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Project  Summary 


The  stated  goal  of  the  research  was  to  demonstrate  that  robustly  computable  motion  features 
can  be  used  directly  as  a  means  of  detecting  and  recognizing  moving  objects.  Specifically,  the 
goal  was  to  design,  implement,  and  test  a  general  framework  for  detecting  movement  from  a 
moving  platform,  and  recognizing  both  distributed  motion  activity  on  the  basis  of  temporal  tex¬ 
ture,  and  complexly  moving,  compact  objects  on  the  basis  of  their  action.  This  recognition 
approach  contrasts  with  the  reconstructive  approach  that  has  typified  most  prior  work  on  motion, 
llie  underlying  motivation  is  the  observation  that,  for  objects  that  typically  move,  it  is  frequently 
easier  to  identify  them  when  they  are  moving  than  when  they  are  stationary.  Specifically,  in  the 
case  of  temporal  texture,  we  proposed  to  extract  statistical  spatial  and  temporal  features  from 
approximations  to  the  motion  field  and  use  techniques  analogous  to  those  developed  for  gray¬ 
scale  te:;ture  analysis  to  classify  regional  activities  such  as  windblown  trees,  ripples  on  water,  or 
chaotic  fluid  flow,  that  are  characterized  by  complex,  non-rigid  motion.  For  action  identification, 
we  proposed  to  use  the  spatial  and  temporal  arrangement  of  motion  features  in  conjunction  with 
simple  geometric  image  analysis  to  identify  complexly  moving  objects  such  as  machinery  and 
locomoting  people  and  animals.  The  proposed  work  has  practical  applications  in  monitoring  and 
surveillance,  and  as  a  component  of  a  sophisticated  visual  system. 

By  and  large,  the  goal  of  the  project  were  accomplished.  A  number  of  papers  describing  the 
work  have  iq)peared  in  technical  journals  and  conferences,  and  prototype  code  implementing  the 
algorithms  as  well  as  test  data,  is  available  by  request.  A  detailed  technical  description  of  the 
work  is  contained  in  three  papers  that  are  attached  to  this  report. 

The  first  phase  of  the  project  addressed  the  classification  of  temporal  textures  via  statistical 
characteristics  of  the  associated  motion  fields.  We  developed  a  group  of  statistical  measures 
involving  first  and  second  order  characteristics  of  the  motion  field.  These  measures  included  dif¬ 
ferential  quantities  such  as  curl  and  divergence,  and  spatial  statistics  such  as  directional  co¬ 
occurrence  features.  When  incorporated  into  simple  nearest-neighbor  classifiers,  these  measure 
proved  successful  in  distinguishing  a  number  of  natural  temporal  textures.  Principle  component 
analysis,  carried  out  in  the  motion  feature  space  was  used  to  evaluate  the  relative  effectiveness  of 
the  various  measures.  This  work  is  described  in  the  paper  '‘Qualitative  Recognition  of  Motion 
Using  Temporal  Texture”  attached  to  this  report. 

The  second  phase  of  the  work  involved  the  detection,  isolation,  and  tracking  of  periodically 
moving  objects.  Tliis  group  includes  objects  such  as  walking  and  running  people,  running,  flying, 
or  swimming  animals,  and  some  sorts  of  machinery.  To  human  observers,  many  of  these  objects 
can  be  more  readily  identified  by  their  motion  signatures  than  by  their  shape  -  particularly  in 
low-resolution  or  high-clutter  regimes.  Identification  of  objects  in  this  group  is  also  important  in 
many  practical  applications.  We  developed  a  technique  based  on  the  Fourier  transform  that 
allowed  us  to  flag  and  isolate  periodically  moving  objects  in  real  scenes.  The  method  is  general, 
and  applies  to  a  wide  variety  of  situations,  including  those  with  an  actor  is  translating  against  a 
varying  background,  which  cannot  be  characterized  by  a  simple  cyclical  image.  This  work  is 
descri^  in  the  attached  paper  “Detecting  Activities”. 

The  final  phase  of  the  work  involved  the  identification  of  periodic  activities  once  they  had 
been  isolated  in  an  image,  e.g.  whether  the  motion  is  produced  by  a  walking  or  a  running  person, 
or  something  else  entirely.  Previous  approaches  to  this  problem  have  relied  upon  analysis  of 
joint  trajectories,  often  obtained  by  attaching  lights  to  the  limbs  of  an  actor.  The  problem  with 
this  approach  is  that  it  is  not  clear  how  to  obtain  the  required  trajectories  from  a  raw  image 
sequence  -  the  joints  must  be  identified  and  tracked.  It  also  is  hard  to  generalize  to  other  motions. 
We  developed  a  method  for  classifying  periodic  movement  based  on  low-level  motion  features. 
Basically,  the  detection  and  isolation  procedures  developed  in  the  previous  phases  of  the  research 
allowed  us  to  define  a  canonical  form  for  arbitrary  peric^ically  moving  objects.  With  the  data  in 


this  normalized  form,  a  representation  consisting  of  a  spatiotemporal  template  of  local  motion 
features  could  be  effectively  used  to  classify  a  wide  variety  of  moving  objects.  Since  no  prior 
models  of  the  objects  are  required,  the  technique  is  more  general  than  those  based  on  joint  trajec¬ 
tories.  The  method  was  demonstrated  on  a  database  of  real-world  image  sequences  containing  a 
variety  of  movements  including  running  and  walking  people,  people  on  swings,  and  mechanical 
animals.  The  technique  seems  to  have  sufficient  resolution  to  distinguish,  for  example,  walking 
from  running,  as  well  as  from  less  sintilar  motions,  across  multiple  actors.  It  does  not  have  the 
resolution  to  reliably  distinguish  individuals  on  the  basis  of  their  gait.  This  work  is  described  in 
the  attached  paper  "Recognizing  Activities". 
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We  describe  a  method  ttf  visual  rootfcm  reoognitiaa  applicable  to 
a  range  of  naturally  occurring  motions  that  are  charairierixed  by 
spatial  and  temporal  uniformity.  The  underlying  motivation  is  the 
observation  that,  for  objects  that  typically  move,  it  is  frequently 
easier  to  identify  them  wdien  they  are  moving  than  when  they  are 
statkmaiy.  Specifically,  we  show  that  certain  statistical  spatial 
and  temporal  features  that  can  be  derived  from  approximati^  to 
the  motion  field  have  invariant  properties,  and  can  be  used  to 
classify  rqional  activities  such  as  windblown  trees,  ripples  on 
water,  or  chaotic  fluid  flow,  that  are  characterized  by  complex, 
nonrigid  motion.  We  refer  to  the  technique  as  temporal  texture 
analysis  in  analogy  to  the  techniques  devdoped  to  classify  gray¬ 
scale  textures.  This  recognition  approach  contrasts  with  the  recon¬ 
structive  approach  that  has  typi^  most  prior  work  on  motion. 
We  demonstrate  the  technique  on  a  numbCT  <d  real-world  image 
sequences  containing  complex  nravemeiit.  The  work  has  practical 
application  in  monitoring  and  survefllance,  and  at  a  component  of 
a  sophisticated  visual  system,  o  imz  sk. 


1.  INTRODUCTION 

Who  has  not  watched  ripples  spread  across  a  pool  and 
known  water  thereby?  Or  seen  leaves  shimmer  their  sil¬ 
ver  backs  in  a  summer  breeze  and  known  a  tree?  Who  has 
not  known  the  butterfly  by  her  fluttering?  Or  seen  a  dis¬ 
tant  figure  walking  and  known  there  goes  a  man?  In  order 
to  successfully  interact  with  a  dynamic  world,  an  ^ent 
must  interpret  the  activity  around  it.  In  the  vision  sys¬ 
tem,  this  requires  the  interpretation  of  visual  motion.  The 
everyday  experience  of  visual  motion  incorporates  a  con¬ 
siderable  element  of  recognition;  this  may  even  be  its 
dominant  attribute.  Yet  surprisingly,  this  aspect  of  mo¬ 
tion  has  been  neglected  in  the  literature  on  computational 
motion  analysis,  which  has  emphasized  instead,  a  recon¬ 
structive  approach.  We  show  here  that  robustly  comput¬ 
able  motion  features  can  be  used  directly  as  a  means  of 
recognition.  In  particular,  we  argue  that  there  exists  a 
class  of  image  motions,  common  in  scenes  of  the  natural 
environment,  that  are  characterized  by  structural  or  sta¬ 
tistical  self-similarity  in  space  and  time.  Typical  exam¬ 
ples  might  include  ripples  on  a  pool,  a  flock  of  birds, 
windblown  grass  or  trees,  and  turbulent  weather  patterns 
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in  the  atmosphere.  Such  motions,  referred  to  as  temporal 
textures,  can  be  efficiently  identified  using  statistical  pat¬ 
tern  recognition  techniques  based  on  invariant  features  of 
the  motion  field. 

Visual  motion  has,  of  course,  long  been  considered  an 
important  source  of  information  in  natural  vision  sys¬ 
tems.  Many  of  the  (comparatively)  unsophisticated  sys¬ 
tems,  such  as  those  possessed  by  insects  and  lower  verte¬ 
brates,  are  essentially  blind  to  anything  that  is  not 
moving.  Even  in  the  more  sophisticated  systems  pos¬ 
sessed  by  higher  vertebrates,  including  man,  motion  in 
the  visual  field  retains  an  important  role.  Moving  objects 
in  a  scene  are  typically  the  first  attended  to,  and  a  wide 
variety  of  (semi)quantitative  information  relating  to  ob¬ 
ject  segmentation,  depth,  three  dimensional  shape,  and 
object  and  observer  motion,  seems  to  be  derived  from  the 
visual  motion  field. 

The  potential  wealth  of  derivable  information  inspired 
a  large  body  of  work  on  the  compuution  of  exact  geomet¬ 
ric  quantities  such  as  the  3*D  shape  of  objects,  their  loca¬ 
tion,  and  the  motion  of  the  observer.  This  reconstruction 
problem  is  sometimes  referred  to  as  the  structure-from- 
motion  problem.  Research  has  been  typicaUy  divided 
into  two  main  areas:  finding  3-D  information  from  2-D 
projected  motion  assuming  it  is  available,  and  determin¬ 
ing  projected  motion  from  raw  image  sequences.  Results 
have  been  obtained  in  both  areas;  however,  the  high-level 
shapes  from  motion  algorithms  tend  to  be  very  sensitive 
to  ffie  accuracy  of  the  underlying  motion  information, 
and  the  accuracy  of  the  computed  motion  information 
has  typically  been  low.  Consequently,  only  moderate 
success  has  been  achieved  in  this  area. 

The  emphasis  on  visual  motion  as  a  means  of  quantita¬ 
tive  reconstruction  of  world  geometry  has  tended  to  ob¬ 
scure  the  fact  that  motion  can  also  be  used  for  recogni¬ 
tion.  In  fact,  in  biological  systems,  the  use  of  motion 
information  for  recognition  is  often  more  evident  than  its 
use  in  reconstruction.  A  simple  e.xample  occurs  in  the 
case  of  the  common  toad  Bufo  bufo  for  which  any  elon¬ 
gated  object  within  a  certain  size  range  that  exhibits  mo¬ 
tion  along  the  long  axis  is  identified  as  a  potential  food 
item,  and  elicits  an  orienting  response  [Ewar87].  Birds 
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ignore  the  natural  movement  of  trees  in  the  wind,  but 
respond  immediately  to  the  approach  of  a  predator.  More 
generally,  stylized  movements  seem  to  be  a  universal 
form  of  communication  between  animals  with  eyes,  from 
the  aggressive  posturing  of  various  fiddler  crabs  {Uca 
species),  to  the  mating  dance  of  the  blue  footed  booby 
{Sula  nebouxi),  to  the  expressive  facial  movements  of 
baboons. 

Humans  have  a  remarkable  ability  to  recognize  differ¬ 
ent  kinds  of  motion,  both  of  discrete  objects,  such  as 
animals  or  people,  and  in  distributed  patterns  as  in  wind¬ 
blown  leaves,  or  waves  on  a  pond.  A  classic  illustration 
of  motion  recognition  by  humans  is  provided  by  Moving 
Light  Display  experiments  where  the  sole  source  of  infor¬ 
mation  atwut  a  moving  actor  is  provided  by  lighted  points 
attached  to  a  few  joints  (Joha73].  People  shown  these 
images  dismiss  single  frames  as  meaningless  dot  patterns 
but  can  recognize  characteristic  gaits  such  as  running  or 
walking,  and  even  gender  and  familiar  individuals  from 
the  sequential  presentation. 

Such  abilities  suggest  that,  in  the  case  of  machine  vi¬ 
sion,  it  might  be  possible  to  use  motion  as  a  means  of 
recognition  directly,  rather  than  indirectly  through  a  geo¬ 
metric  reconstruction.  In  addition  to  the  biological  moti¬ 
vations,  there  are  computational  reasons  for  considering 
motion  as  a  recognition  modality.  One  advantage  is  that 
the  motion  field,  insomuch  as  it  can  be  extracted  at  all,  is 
robust  with  respect  to  lighting  changes,  and  much  more 
simply  related  to  shape  than  is  image  luminance.  Further¬ 
more,  if  the  task  is  to  find  an  object  that  is  known  to  be 
moving,  motion  can  be  used  to  efficiently  presegment  the 
scene  into  regions  of  high  and  low  interest.  This  can  fre¬ 
quently  be  done  even  if  the  observer  is  itself  moving 
[Nels90]. 

Recognition  can  thus  be  viewed  as  an  alternative,  more 
qualitative  approach  to  utilizing  visual  motion.  Structure- 
from-motion  can  be  viewed  as  a  general  transformation 
of  information  in  one  form  (time  varying  images)  into  a 
(presumably)  more  useful  form  (e.g.,  depth  maps).  Rec¬ 
ognition,  on  the  other  hand,  serves  to  identify  a  specific 
situation  of  interest  to  the  system,  for  instance,  the  ap¬ 
proach  of  a  fly  if  you  are  a  frog,  or  a  bird  if  you  are  a  fly. 
A  reconstructed  world  model  contains  a  lot  of  informa¬ 
tion,  possibly  enough  to  find  a  fly  if  you  are  a  frog,  but  it 
also  contains  a  lot  of  information  that  a  frog  has  no  inter¬ 
est  in,  and  that  was  expensive  to  obtain. 

The  above  illustrates  a  central  point  of  the  active/be- 
havioral  approach  to  vision,  namely,  that  in  any  practical 
system,  both  the  information  extracted  and  its  represen¬ 
tation  must  take  into  account  the  function  of  the  system. 
The  primary  reason  is  that  the  total  quantity  of  informa¬ 
tion  contained  in  a  visual  signal  is  far  greater  than  any 
system  needs  or  can  handle.  In  most  proposed  applica¬ 
tions  of  vision,  all  but  a  tiny  fraction  of  this  information  is 


irrelevant.  The  fundamental  problem  of  vision  is  deter¬ 
mining  what  image  information  can  be  used  and  extract¬ 
ing  an  efficient  representation  for  it.  Despite  this  fact,  the 
goal  of  machine  vision  has  often  been  portrayed  as  the 
problem  of  devising  information  transforms  that  preserve 
as  much  of  the  original  information  as  possible,  albeit  in  a 
purportedly  more  convenient  form,  on  the  grounds  that 
one  never  knows  what  one  might  need,  and  that  informa¬ 
tion  once  thrown  out  cannot  be  recovered.  Reconstruc¬ 
tionist  approaches  in  which  the  goal  is  to  determine  “in¬ 
trinsic  images"  representing  for  example  the  distance, 
surface  norma),  relative  velocity,  reflectivity,  and  illumi¬ 
nation  for  every  point  in  the  image,  are  of  this  sort.  We 
believe  that  such  a  least  commitment  strategy  is  exactly 
the  wrong  approach.  Assuming  ignorance  about  a  situa¬ 
tion  in  which  considerable  structure  exists  is  generally 
poor  policy,  and  in  the  case  of  vision,  it  can  be  disas¬ 
trous.  The  strategy  that  should  be  followed  is  to  throw 
out  as  much  information  as  quickly  as  possible  on  the 
grounds  that  what  is  thrown  out  does  not  have  to  be 
processed  and  does  not  tie  up  limited  computati'^nal  re¬ 
sources.  This  might  be  termed  a  strategy  of  most  commit¬ 
ment.  The  behavioral  approach  provides  a  mechanism 
for  deciding  what  can  be  thrown  out  via  the  use  of  a  prior 
knowledge  about  the  functionality  of  the  system.  Know¬ 
ing  exactly  what  information  is  needed  and  what  it  will  be 
used  for  also  permits  the  system  to  alter  its  interaction 
with  the  world  dynamically,  in  order  to  make  that  infor¬ 
mation  more  readily  obtainable. 

We  define  qualitative  vision  as  the  computation  of 
iconic  image  properties  (qualities)  having  a  stable  rela¬ 
tionship  to  functional  primitives.  These  functional  rela¬ 
tions  are  the  building  blocks  for  visual  behavior.  The 
iconic  nature  of  qualitative  primitives  provides  the  neces¬ 
sary  information  reduction;  only  a  minute  fraction  of  the 
original  information  is  present,  but  it  is  directly  relevant 
to  the  task  at  hand.  Recognition  is  a  qualitative  statement 
in  this  sense.  It  classifies  a  situation  in  terms  relevant  to 
some  functionality.  In  the  case  of  motion,  there  are  a 
variety  of  applications  in  which  robustly  computable  mo¬ 
tion  information  can  be  used  for  identification  directly, 
and  much  more  efficiently,  than  via  traditional  3-D  recon¬ 
struction. 

An  example  of  a  directly  useful  motion  feature  is  the 
regional  divergence  of  the  motion  field,  which  can  be 
used  to  detect  approaching  objects.  In  houseflies  {Musca 
domestica)  divergent  flow  activates  a  landing  reflex  when 
approaching  a  surface.  We  have  implemented  a  collision 
avoidance  system  based  on  the  divergence  cue  [Nels89]. 
The  basic  idea  is  that  a  region  on  a  collision  course  with 
the  observer  will  be  expanding  and  thus  display  positive 
divergence.  We  utilized  a  set  of  features  termed  direc¬ 
tional  divergences  D^t  parameterized  by  a  polar  angle  d> 
and  given  by 
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where is  the  component  of  the  motion  field  f  in  the  <f) 
direction.  These  are  equivalent  to  1-D  divergences  along 
various  axes,  and  can  be  robustly  computed  from  pro¬ 
jected  flow  information,  which  is  easier  to  obtain  than  the 
full  motion  field.  The  system  was  used  to  guide  a  robot 
vehicle  between  obstacles.  Figure  1  shows  the  diver¬ 
gence  produced  by  a  pair  of  obstacles  toward  which  the 
vehicle  is  moving. 

The  divergence  is  a  simple  example  of  a  temporal  tex¬ 
ture.  that  is,  a  regional  property  that  identifies  an  area  as 
a  certain  sort  of  "stuff '  (here  stuff  that  might  collide  with 
you),  somewhat  as  gray-level  textures  can  identify  re¬ 
gions  in  a  static  image.  Examples  of  more  complex  tem¬ 
poral  textures,  which  would  require  a  combination  of 
several  motion  features  for  classification,  include  the  flut¬ 
tering  of  leaves  on  a  tree;  the  glitter  of  sunlight  from 
distant  water  and  wave  motion  in  nearby  water;  the  mo¬ 
tion  within  a  flock  of  birds,  on  top  of  an  anthill,  or  in  a 
crowd  at  a  football  game:  the  effect  produced  by  moving 
near  a  fractal  object  such  as  a  bush;  a  snowfall,  a  water¬ 
fall;  the  turbulent  curl  of  smoke;  and  the  swirl  of  clouds 
in  a  weather  system. 

There  are  a  number  of  potential  applications  for  motion 
recognition.  One  area  in  which  it  would  be  useful  is  in 
automated  surveillance.  Motion  detection  via  image  dif¬ 
ferencing  can  be  used  for  intruder  detection;  however, 
such  systems  are  subject  to  false  alarms,  especially  in 
outdoor  environments,  since  the  system  is  triggered  by 
anything  that  moves,  whether  it  be  a  human,  a  dog,  or  a 
tree  blown  by  the  wind.  Motion  recognition  techniques, 
both  of  the  discrete  and  textural  variety  have  the  poten¬ 
tial  to  disambiguate  the  motions  of  different  origin.  An¬ 
other  application  is  in  industrial  monitoring.  Many  manu¬ 
facturing  operations  involve  a  long  sequence  of  simple 
operations,  each  performed  repeatedly  and  at  high  speed 
by  a  specialized  mechanism  at  a  particular  location.  It 
should  be  possible  to  set  up  one  or  more  fixed  cameras 
that  cover  the  area  of  interest,  and  to  characterize  the 
allowed  motions  in  each  region  of  the  image(s).  Abnor¬ 
mal  activity  would  violate  the  prior  constraints  and  allow 
the  location  of  a  problem  to  be  identified  quickly.  This 
sort  of  analysis  would  be  particularly  valuable  for  the  fast 
detection  and  neutralization  of  catastrophic  failures  that 
traditional  quality  control  systems  might  not  identify  in 
time  to  prevent  major  damage.  A  similar  situation  arises 
in  security  surveillance  of  a  compound,  where  certain 
types  of  motion  may  be  expected  in  certain  areas  and  in 
certain  situations  (e.g..  the  opening  of  a  gate  after  an 


FIG.  t.  Obstacle  detection  via  flow  field  divergence. 


approval  signal  has  been  sent)  but  not  in  others  (e.g..  a 
man  climbing  over  a  wall).  Other  possibilities  include 
monitoring  satellite  imagery  for  developing  storm  sys¬ 
tems  and  crowds  for  incipient  disturbances.  General  mo¬ 
tion  recognition  techniques  could  also  be  applied  to  areas 
such  as  gesture  recognition  [Rhyn86]  and  handwriting 
analysis. 


2.  BACKGROUND  AND  RELATED  WORK 

Motion  recognition  in  general  has  received  relativeh' 
little  attention  in  the  literature.  Most  computational  mo¬ 
tion  work,  as  mentioned  previously,  has  been  concerned 
with  various  aspects  of  the  structure-from-motion  prob¬ 
lem.  There  is  a  large  body  of  psychophysical  literature 
addressing  the  perception  of  motion,  most  of  it  con- 
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cerned  with  primitive  percepts.  A  modest  amount  of  this 
work  addresses  more  complicated  motion  recognition  is¬ 
sues  [Joha73.  CuttSl.  Hoff82.  Hild87],  but  the  models 
and  descriptions  have  typically  not  been  implemented. 
Various  computational  models  of  temporal  structure 
have  been  proposed  (e.g..  [Chun86.  Feld88])  but  much  of 
this  work  is  at  a  fairly  high  level  of  abstraction  and  has 
not  actually  been  applied  to  visual  motion  recognition 
except  in  rather  artificial  tests.  Some  of  the  best  work  in 
temporal  pattern  recognition  has  actually  been  done  in 
the  context  of  speech  processing  [JuanSS,  Tank87, 
Elma88]. 

A  few  studies  have  considered  highly  specific  aspects 
of  motion  recognition  computationally.  Pentland  [Pent891 
considered  lip  reading,  and  implemented  a  system  that 
could  recognize  spoken  digits  with  70-90%  accuracy 
over  five  speakers.  The  system  required  the  location  of 
the  lips  to  be  entered  by  hand,  and  depended  on  an  ex¬ 
plicitly  constructed  lip  model.  Rashid  [Rash80,  Godd89] 
considered  the  computational  interpretation  of  moving 
light  displays,  particularly  in  the  context  of  gait  determi¬ 
nation.  This  work  emphasized  rather  high-level  symbolic 
models  of  temporal  sequences,  an  approach  made  possi¬ 
ble  by  the  discrete  nature  of  the  moving  light  displays. 
The  results  were  quite  sensitive  to  discrete  errors  and 
thus  highly  dependent  on  the  ability  to  solve  the  corre¬ 
spondence  problem  and  accurately  track  joint  and  limb 
positions.  This  severely  limits  the  general  applicability  of 
the  method.  Anderson  et  al.  [Ande8S]  describe  a  method 
of  change  detection  for  sui^  eillance  applications  based 
on  the  spectral  energy  in  a  temporal  difference  image. 
This  has  the  flavor  of  the  temporal  texture  analysis  de¬ 
scribed  here,  but  was  not  generalized  to  other  motion 
features  or  more  sophisticated  recognition. 

Some  of  the  work  done  in  the  context  of  the  structure 
from  motion  problem,  particularly  the  methods  that  have 
been  developed  to  obtain  local  motion  information  from 
image  sequences,  is  relevant  to  temporal  texture.  Al¬ 
though  we  make  somewhat  different  use  of  the  informa¬ 
tion,  this  work  has  motivated  and  provided  foundations 
for  our  approach,  and  it  is  thus  appropriate  to  review  the 
field. 

A  camera  moving  within  a  three-dimensional  environ¬ 
ment  produces  a  time-varying  image  that  can  be  charac¬ 
terized  at  any  time  /  by  a  two-dimensional  vector-valued 
function  /  known  as  the  motion  field.  The  motion  field 
describes  the  two-dimensional  projection  of  the  three- 
dimensional  motion  of  scene  points  relative  to  the  cam¬ 
era.  Mathematically,  the  motion  field  is  defined  as  fol¬ 
lows.  For  any  point  (x,  y)  in  the  image,  there  corresponds 
at  time  t  a  three-dimensional  scene  point  lx',  y',  z') 
whose  projection  it  is.  At  time  t  -f-  It,  the  world  point  (.v', 
y',  z')  projects  to  the  image  point  (.v  +  Ax,  y  +  Ay).  The 
flow  field  at  (.t,  y)  at  time  t  is  given  by 
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The  motion  field  depends  on  the  motion  of  the  camera, 
the  three-dimensional  structure  of  the  environment,  and 
the  three-dimensional  motion  (if  any)  of  objects  in  the 
environment.  If  all  these  components  are  known,  then  it 
is  relatively  straightforward  to  calculate  the  motion  field. 
In  the  traditional  approach  to  motion  analy  sis,  the  ques¬ 
tion  has  been  whether  the  process  can  be  inverted  to 
obtain  information  about  camera  motion  and  structure  of 
the  environment.  This  is  the  basis  of  the  structure-from- 
motion  problem.  The  solution  is  not  easy,  and  if  arbitrary 
shapes  and  motions  are  permitted  in  the  environment, 
there  may  not  be  a  unique  solution.  However,  it  can  be 
mathematically  demonstrated  that,  in  many  situations,  a 
unique  solution  exists. 

The  existence  of  such  solutions  has  inspired  a  large 
body  of  work  on  the  mathematical  theory  of  extracting 
shape  and/or  motion  information  from  the  motion  field. 
There  have  been  two  basic  approaches  to  the  problem. 
The  first  utilizes  point  correspondences  in  one  or  more 
images,  generally  under  the  assumption  of  environmental 
rigidity  [Ullm79,  Tsai8l].  This  is  equivalent  to  knowing 
the  motion  field  at  isolated  points  of  the  image.  Several 
authors  have  obtained  closed  form  solutions  to  the  shape 
from  motion  problem  in  this  formulation,  obtaining  a  set 
of  linearized  equations  (LongSI,  Tsai84].  The  second  ap¬ 
proach  uses  information  about  the  flow  and  its  deriva¬ 
tives  in  a  local  neighborhood  under  some  assumption 
about  the  structure  of  environmental  surfaces  (e.g.,  they 
are  planar)  [PrazSl,  Boll87,  Waxm87].  In  this  case,  the 
end  result  is  a  set  of  equations  relating  the  flow  field 
derivatives  to  the  camera  motion  and  the  three-dimen¬ 
sional  structure  of  the  environment.  Most  of  these  stud¬ 
ies,  however,  have  started  with  the  assumption  that  de¬ 
tailed  and  accurate  information,  either  in  the  form  of 
point  correspondences  or  dense  motion  fields,  is  avail¬ 
able.  Unfortunately,  the  solutions  to  the  equations  are 
frequently  inordinately  sensitive  to  small  errors  in  the 
motion  field.  In  the  case  of  point  correspondences.  Tsai 
and  Huang  [Tsai84]  report  60%  error  for  a  1%  perturba¬ 
tion  in  input  for  some  instances  using  their  method.  This 
error  sensitivity  is  due  both  to  inherent  ambiguities  in  the 
motion  fields  produced  by  certain  camera  motions,  at 
least  over  restricted  fields  of  view,  and  (in  the  second 
approach)  to  the  reliance  on  differentiation  of  the  flow 
field,  which  amplifies  the  effect  of  any  error  present  in  the 
data.  - 

The  two  approaches  to  obtaining  shape  from  motion 
utilize  somewhat  different  methods  for  extracting  motion 
information  from  image  sequences.  The  methods  using 
point  correspondences  rely  on  matching  techniques  simi¬ 
lar  to  those  employed  in  stereo  vision  [Moro79.  Marr79, 
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BamSO).  This  process  is  well  known  to  be  difficult  since 
features  may  change  from  one  image  to  the  next,  and 
even  appear  and  disappear  completely. 

Techniques  for  computing  dense  motion  fields  have  re¬ 
lied  heavily  on  differential  methods,  which  attempt  to 
determine  the  motion  field  from  local  computations  of  the 
spatial  and  temporal  derivatives  of  the  gray-scale  image. 
The  first  derivative  methods  originally  proposed  by  Horn 
and  Schunk  [HomSl]  must  deal  with  what  is  known  as 
the  aperture  problem,  which  refers  to  the  fact  that  only 
the  component  of  optical  flow  parallel  to  the  local  image 
gradient  can  be  recovered  from  first-order  differential  in¬ 
formation.  Intuitively,  the  aperture  problem  corresponds 
to  the  fact  that  for  a  moving  edge,  only  the  component  of 
motion  perpendicular  to  the  edge  can  be  determined. 
This  effect  is  responsible  for  the  illusion  of  upward  mo¬ 
tion  produced  by  the  rotating  spirals  of  a  barber  pole 
where  either  vertical  or  horizontal  motion  could  produce 
the  local  motion  of  the  edges,  and  the  eye  chooses  the 
wrong  one.  In  order  to  determine  both  components  of  the 
flow  field  vector,  information  must  be  combined  over  re¬ 
gions  large  enough  to  encompass  significant  variations  in 
the  gradient  direction.  The  most  common  method  of  do¬ 
ing  this  involves  some  form  of  regularization  [HomSl, 
Anan95,  Nage86];  however,  such  methods  often  result  in 
blurring  of  motion  discontinuities.  A  nonblurring  method 
known  as  constraint  line  clustering  has  been  proposed  by 
Schunck  [Schu84].  Techniques  using  higher  order  deriva¬ 
tives  to  avoid  the  aperture  problem  have  been  proposed 
[Nage83,  Uras88);  however,  these  suffer  from  stability 
problems  due  to  multiple  differentiation  and  typically  re¬ 
quire  extensive  smoothing  to  produce  clean  results. 
Other  methods  include  spatiotemporal  energy  methods 
[Heeg87],  Fourier  methods  based  on  phase  correlation 
lBurt89],  and  direct  correlation  of  image  patches 
[BamSO,  Litt88].  Recent  work  by  Anandan  [Anan89]  pro¬ 
vides  a  common  framework  into  which  many  of  these 
methods  can  be  incorporated. 

A  potential  problem  with  most  of  the  above  approaches 
is  the  assumption  that  the  motion  field  manifests  itself 
locally  as  a  rigid  2-D  motion  of  an  image  pat  ;h.  Unfortu¬ 
nately.  the  local  apparent  motion  of  the  image,  known  as 
the  optical  flow,  does  not  necessarily  correspond  to  the  2- 
D  motion  field.  The  most  obvious  demonstrations  are 
pathological  examples.  For  instance,  a  spinning,  feature¬ 
less  sphere  under  constant  illumination  has  zero  optical 
flow,  but  a  nonzero  motion  field.  Conversely,  a  station¬ 
ary  sphere  under  changing  illumination  has  nonzero  opti¬ 
cal  flow,  but  zero  motion  field.  Image  patches  also  un¬ 
dergo  various  nonrigid  deformations  such  as  expansion 
and  skewing.  Verri  and  Poggio  [Verr87]  have  shown  that 
only  under  special  conditions  of  lighting  and  movement 
do  the  motion  field  and  the  optical  flow  correspond  ex¬ 
actly.  They  also  show,  however,  that  for  sufficiently  high 


gradient  magnitude,  the  agreement  can  be  made  arbitrar¬ 
ily  close.  This  corresponds  to  the  intuition  that  for 
strongly  textured  images  the  motion  field  and  the  optical 
flow  are  approximately  equal.  A  few  authors  have  at¬ 
tempted  to  explicitly  include  some  of  these  effects  (e.g., 
[Burt89]),  but  it  is  not  clear  that  any  great  advantage  has 
been  obtained  thereby. 

On  the  whole,  despite  a  great  deal  of  effort  expanded  in 
devising  flow  invariants,  regularization  methods,  and 
matching  techniques,  neither  correspondence  nor  flow 
field  methods  have  yielded  data  sufficiently  accurate  to 
allow  the  theoretical  structure-from-motion  results  to  be 
reliably  applied.  Adiv  [Adiv8S]  argues  that  inherent  near 
ambiguities  in  the  3-D  structure-from-motion  problem 
may  make  the  goal  of  extracting  information  sufficiently 
precise  to  allow  uniform  application  of  the  theoretical 
solutions  unattainable  in  practice.  Verri  and  Poggio 
[Verr87]  make  essentially  the  same  point,  arguing  that 
the  disagreement  between  the  motion  field  and  the  optical 
flow  makes  the  computation  of  sufficiently  accurate 
quantitative  values  impractical. 

An  alternative  is  to  devise  qualitative  applications  that 
can  make  use  of  inaccurate  flow  information  [Thom86, 
Nels88,  Nels89).  The  motion  recognition  strategies  pro¬ 
posed  here  represent  one  such  application.  Many  of  the 
motion  features  proposed  in  the  next  section  are  qualita¬ 
tive  in  the  sense  that  their  detection  does  not  rely  on 
highly  accurate  measurements  of  the  motion  field.  In 
fact,  useful  motion  features  can  be  obtained  from  partial 
information  such  as  the  projected  flow  computed  in  the 
first  step  of  the  Horn  and  Schunck  procedure,  for  exam¬ 
ple,  the  directional  divergence  used  for  obstacle  avoid¬ 
ance  in  [Nels89].  To  reiterate,  our  idea  is  to  use  motion 
information  for  identification  directly,  rather  than  pro¬ 
ceeding  indirectly,  through  the  reconstruction  of  an  ana¬ 
log  3-D  world  model. 

3.  TEMPORAL  TEXTURE 

Classical  gray-level  texture  analysis  is  concerned  with 
the  identification  of  spatial  invariances  in  the  gray-level 
patterns  in  an  image  region.  These  invariances  may  be 
either  structurally  or  statistically  defined.  The  basic  idea 
is  to  characterize  different  sorts  of  "stuff"  of  indetermi¬ 
nate  spatial  extent  in  terms  of  such  invariances.  In  this 
article  we  extend  this  basic  idea  into  the  temporal  dimen¬ 
sion  with  the  idea  of  recognizing  similar  stuff  in  dynamic 
scenes.  This  is  motivated  in  part  by  the  existence  of  a 
large  class  of  natural  phenomena  that  seem  to  have  char¬ 
acteristic  motions,  but  indeterminate  spatial  extent.  Ex¬ 
amples  include  windblown  trees  or  grass,  turbulent  flow 
in  cloud  patterns,  ripples  on  water,  falling  snow,  and  the 
motion  of  a  flock  of  birds  or  a  crowd  of  people.  The 
motion  in  a  temporal  texture  is  distinct  from  that  in  pat- 
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terns  such  as  walking  and  cycling,  which  involve  struc¬ 
ture  at  a  single  location. 

Temporal  texture  could  be  analyzed  directly  as  a  three- 
dimensional  signal  using  generalizations  of  the  tech¬ 
niques  applied  to  two-dimensional  fields.  However,  since 
most  changes  along  the  time  dimension  are  due  to  motion 
in  the  image,  it  makes  sense  to  preprocess  the  time-vary¬ 
ing  image  to  obtain  motion  information,  as  it  is  in  object 
motion  that  the  physical  invariances  lie.  In  this  case,  a 
natural  choice  is  the  optic  flow  held.  The  basic  source  of 
information  is  thus  a  time-varying  vector  field  represent¬ 
ing  an  approximation  to  the  two-dimensional  motion  field 
induced  by  movement  in  the  world.  Such  a  field  contains 
considerably  more  information  than  the  scaler  valued 
field  associated  with  gray-level  texture  analysis.  In  addi¬ 
tion,  the  direction  and  magnitude  of  motion  have  a  more 
direct  relationship  to  typically  salient  events  in  the  world 
than  the  gray  level  of  a  single  pixel.  Consequently,  cer¬ 
tain  types  of  recognition  might  be  expected  to  be  easier. 
For  example,  in  the  right  context,  fast  downward  motion 
could  be  taken  as  evidence  of  a  falling  object.  It  is  diffi¬ 
cult  to  envisage  making  any  similar  statement  about  (say) 
gray  level  147.  A  problem  with  using  optic  flow  is  that  it  is 
difficult  compute  accurately.  One  solution  is  to  devise 
measures  that  are  insensitive  to  inaccuracy.  Another  is  to 
utilize  partial  information.  An  example  is  the  gradient 
parallel  component  of  the  optic  flow,  which  is  simpler  to 
compute  locally  from  an  image  sequence  than  the  full 
motion  field. 

Despite  the  differences  in  domain,  some  techniques  of 
spatial  texture  analysis  are  applicable  to  temporal  tex¬ 
tures.  Spatial  texture  analysis  is  traditionally  performed 
using  either  statistical  or  syntactic  methods.  Statistical 
methods  utilize  measures  of  local  features  that  are  ex¬ 
pected  to  be  similar  within  patches  of  the  same  texture. 
Examples  of  measurements  that  have  been  used  include 
gray-level  cooccurrence  matrices  (Hara73,  ConnSO], 
Fourier  power  spectra  iBajc76,  Chen82J,  and  average 
magnitude  response  of  filter  masks  [LawsSO,  Mali89]. 
There  are  also  several  methods  based  on  estimation  pa¬ 
rameters  for  a  description  of  a  region  in  terms  of  some 
texture  model.  E.iamples  include  autoregressive  models 
[Kash82)  and  Markov  random  fields  [Kane82].  Syntactic 
approaches  are  most  appropriate  for  highly  regular  tex¬ 
tures  and  involve  analyzing  the  geometric  arrangement  of 
primitive  structural  elements.  In  the  case  of  natural  tem¬ 
poral  textures,  techniques  similar  to  the  statistical  gray- 
level  methods  seem  most  appropriate,  and  most  of  the 
features  described  in  this  article  are  of  this  type.  As  with 
spatial  textures,  the  main  criteria  for  selecting  features 
are  that  they  change  little  within  a  given  texture  (i.e.,  an 
area  of  the  same  stufO.  and  that  they  vary  significantly 
between  different  textures. 

The  dimensionality  of  the  vector-valued  flow  field  and 


the  fact  that  measures  can  be  made  in  both  space  and 
time  allow  considerable  latitude  in  designing  features. 
Since  textures  are  characterized  by  statistical  regularities 
in  the  occurrence  of  local  structure,  extraction  of  fea¬ 
tures  useful  for  classification  generally  involves  at  least 
two  tiers  of  processing:  A  local  feature  extraction  stage, 
and  (at  least  one)  spatially  or  temporally  extended  inte¬ 
gration  stage.  Local  features  can  be  any  useful  quantity 
that  can  be  associated  with  a  point  in  the  image.  Exam¬ 
ples  include  flow  magnitude  and  direction,  differential 
measures  such  as  divergence  and  curl,  and  local  uniform¬ 
ity  measures.  The  spatiotemporal  motion  energy  filters 
introduced  by  Heeger  [Heeg87]  could  also  provide  useful 
measures  in  this  context.  Typically  these  are  expected  to 
vary  within  a  texture,  thus  necessitating  the  integration 
phase.  Extended  measures  are  most  frequently  based  on 
quantities  such  as  means  or  variances,  but  other  ex¬ 
tended  measures,  such  as  Fourier  coefficients  and  cooc¬ 
currence  statistics,  can  be  used.  The  most  typical  struc¬ 
ture  for  a  temporal  texture  feature  involves  extended 
spatial  or  temporal  (or  both)  measures  of  spatiotemporal 
microfeatures.  Features  can  also  be  derived  from  ex¬ 
tended  spatial  measure  of  extended  temporal  features 
and  vice  versa. 

In  order  to  simplify  the  motion  preprocessing,  we  con¬ 
sidered  features  based  on  the  gradient  parallel  compo¬ 
nent  of  the  motion  field,  also  referred  to  as  the  normal 
flow.  The  simplest  local  motion  measures  are  the  magni¬ 
tude  and  direction  of  the  normal  flow.  We  examine  sev¬ 
eral  statistical  features  based  on  the  distribution  of  these 
first-order  quantities.  The  direction  and  magnitude  can  be 
combined  locally,  both  spatially  and  temporally,  to  ob¬ 
tain  second-order  local  motion  measures.  We  also  exam¬ 
ine  features  based  on  the  distribution  of  some  second 
order  measures.  All  these  are  described  below. 

A  useful  statistic  based  on  the  distribution  of  the  nor¬ 
mal  flow  magnitude  is  the  average  flow  magnitude  di¬ 
vided  by  its  standard  deviation.  The  scaling  by  the  stan¬ 
dard  deviation  has  the  effect  of  making  the  measure 
robust  under  scaling  changes.  One  way  to  think  of  this 
statistic  is  as  a  measure  of  “peakiness"  in  the  velocity 
distribution.  It  is  invariant  under  translation,  rotation, 
and  temporal  and  spatial  scaling. 

We  also  considered  statistics  of  second-order  flow 
magnitude  features,  namely,  estimates  of  the  divergence 
and  curl  of  the  motion  field  obtained  from  the  normal 
flow.  Positive  and  negative  divergence  and  positive  and 
negative  curl  were  taken  as  separate  features  to  give  four 
different  second-order  features.  The  features  used  are  the 
mean  values  of  these  quantities  over  the  region  of  inter¬ 
est.  They  are  invariant  with  respect  to  rotation  and  trans¬ 
lation.  but  not  scaling.  If  scale  invariant  features  are  de¬ 
sired,  ratios  of  the  differential  measures  can  be  used. 

A  useful  first-order  statistic  can  be  derived  from  the 
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distribution  of  flow  directions.  Intuitively,  what  is  being 
measured  is  the  nonuniformity  in  direction  of  motion. 
Our  non-uniformity  statistic  was  computed  by  discretiz¬ 
ing  the  direction  into  eight  possible  values,  computing  a 
histogram  over  the  relevant  n .-ighborhood  of  the  image, 
and  summing  the  absolute  aeviation  from  a  uniform  dis¬ 
tribution.  It  should  be  ;oted  that  the  normal  flow  direc¬ 
tion  at  a  pixel  is  parallel  (or  antiparallel)  to  the  gradient 
direction.  Thus  measures  based  on  the  normal  flow  direc¬ 
tion  alone  depend  on  the  underlying  intensity  texture.  To 
reduce  this  dependence,  the  normal  flow  directions  in  the 
hi:>iugram  are  normalized  by  the  four-way  histogram  of 
gradient  directions.  This  feature  is  invariant  under  trans¬ 
lation.  rotation,  and  temporal  and  spatial  scaling. 

Second-order  measures  of  the  normal  flow  direction 
distribution  can  be  derived  from  the  difference  statistics, 
which  give  the  number  of  pixel  pairs  at  a  given  offset  that 
differ  in  their  values  by  a  given  amount.  These  difference 
statistics  can  be  represented  by  a  cooccurrence  matrix  of 
the  normal  flow  direction  surrounding  a  pixel.  Cooccur¬ 
rence  matrices  are  computed  for  four  directions  (horizon¬ 
tal,  vertical,  positive  diagonal,  and  negative  diagonal)  at  a 
distance  proportional  to  the  average  flow  magnitude. 
This  yields  invariance  with  respect  to  scaling.  In  each 
direction  the  ratio  of  the  number  of  pixel  pairs  differing  in 
direction  by  at  most  one  to  the  number  of  pixel  pairs 
differing  by  more  than  one  is  computed.  This  ratio  is  the 
sum  of  the  first  two  difference  statistics  to  the  sum  of  the 
last  three  difference  statistics.  Logarithms  of  the  result¬ 
ing  ratios  are  used  as  a  feature  in  each  of  the  four  direc¬ 
tions,  and  represent  a  measure  of  the  spatial  homogeneity 
of  the  flow.  These  features  are  invariant  under  transla¬ 
tion,  rotation,  and  scaling. 


4.  EXPERIMENTAL  RESULTS 

A  set  of  image  sequences  representing  both  oriented 
temporal  textures  such  as  flowing  water  and  nonoriented 
textures  such  as  leaves  fluttering  in  the  wind  was  digi¬ 
tized.  In  addition,  sequences  representing  uniform  ex¬ 
pansion  and  rotation  of  a  textured  scene  were  obtained. 
These  were  used  in  classification  experiments  utilizing 
the  features  described  above.  Seven  different  texture 
samples,  listed  below,  were  used  for  the  experiments: 

A.  fluttering  crepe  paper  bands 

B.  cloth  waving  in  the  wind 

C.  motion  of  tree  in  the  wind 

D.  flow  of  water  in  a  river 

E.  turbulent  motion  of  water 

F.  uniformly  expanding  image  produced  by  forward 
observer  motion 


G.  uniformly  rotating  image  produced  bv  observer 
roll. 

Representative  examples  of  scenes  and  deri\  ed  flow  are 
illustrated  in  Figs.  2A-2E.  Figure  3  illustrates  the  tempo¬ 
ral  dimension  for  two  of  the  cases,  showing  a  horizontal 
slice  through  the  spatiotemporal  solid.  The  temporal  axis 
runs  vertically. 

For  each  sample  texture,  two  image  sequences  consist¬ 
ing  of  16  256  X  256  pixel  frames  taken  at  30  Hz  w  ere  split 
into  quadrants  to  obtain  eight  independent  sample  image 
sequences  of  128  x  |28  pixels.  The  normal  flow  field  was 
computed  between  each  consecutive  pair  of  image 
frames  using  a  multiresolution  flow  computation,  with 
the  direction  of  normal  flow  quantized  to  one  of  eight 
directions.  The  end  result  of  the  processing  w  as  a  sample 
of  eight  normal  flow  sequences  of  15  frames  each  for  each 
texture. 

Classification  experiments  were  run  using  a  nearest 
centroid  classifier.  More  elaborate  classifiers  could  be 
used,  but  the  nearest  centroid  method  gives  a  fairly  direct 
indication  of  the  utility  of  the  features.  The  features  used 
were  those  described  in  the  previous  section,  namely 

a.  mean  flow  magnitude  divided  by  standard  deviation 

b.  positive  and  negative  curl  and  divergence  estimates 

c.  nonuniformity  of  flow  direction 

d.  directional  difference  statistics  in  four  directions. 

Normalization  constants  were  computed  so  that  the  en¬ 
semble  mean  for  each  feature  was  I .  No  more  sophisti¬ 
cated  normalization  procedure  was  found  necessary. 

The  first  four  samples  of  each  texture  are  used  as  a 
training  set  to  compute  the  centroid  of  the  cluster  corre¬ 
sponding  to  that  texture  in  the  feature  space.  The  differ¬ 
ent  feature  values  are  converted  into  common  units  by 
mapping  the  average  of  the  resulting  centroids  to  a  unit 
vector.  Table  I  contains  the  values  of  these  features  for 
each  flow  sample.  It  can  be  seen  that,  overall,  the  within 
sample  variation  is  smaller  than  the  between  sample  vari¬ 
ation  as  desired.  No  single  feature  is  sufficient  to  distin¬ 
guish  all  the  textures,  but  for  each  texture,  there  is  at 
least  one  feature  that  clearly  separates  it  from  the  others. 
For  example,  as  would  be  expected,  texture  A.  contain¬ 
ing  an  approaching  object,  is  distinguished  by  high  diver¬ 
gence.  For  texture  B.  containing  moving  vertical  bands, 
the  second-order  difference  feature  in  the  vertical  direc¬ 
tion  clearly  separates  it  from  the  rest. 

The  remaining  four  samples  are  tested  using  a  nearest 
centroid  classification  scheme.  The  results  of  classifica¬ 
tion  are  summarized  in  Table  2.  Note  that  none  of  the 
features  alone  is  sufficient  to  separate  all  the  textures,  but 
the  combination  gives  1009^  success  in  the  classification 
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TABLE  1 
Sample  Features 


a 

b 

c 

d 

Pos 

Neg 

Pos 

Neg 

Pos 

Neg 

Texture 

Mag 

div 

div 

curl 

curl 

Dir 

Hor 

Vert 

diag 

diag 

Bands 

1.083 

0.591 

-0.698 

0.275 

-0.199 

0.488 

5.013 

9.084 

5.192 

5.178 

1.009 

9.640 

-0.570 

0.240 

-0.223 

0.663 

4.679 

8.356 

4.%2 

5.055 

1.081 

0.467 

-0.625 

0.212 

-0.209 

0.837 

4.358 

7.878 

4.518 

4.409 

1.221 

0.544 

-0.548 

0.188 

-0.203 

0.694 

5.319 

8.954 

5.540 

5.4.'2 

B:  Cloth 

1.417 

0.648 

-0.620 

0.314 

-0.322 

0.928 

4.265 

6.170 

4.806 

4.289 

1.529 

0.530 

-0.557 

0.323 

-0.335 

0.917 

4.939 

5.681 

6.149 

4.826 

1.282 

0.610 

-0.597 

0.317 

-0.308 

0.942 

3.390 

4.972 

3.882 

3.338 

1.393 

0.610 

-0.647 

0.337 

-0.342 

0.902 

3.5% 

4.732 

4.276 

3.563 

C:  Plant 

0.964 

0.708 

-0.216 

0.196 

-0.297 

0.947 

1.481 

2.103 

2.276 

1.466 

1.064 

0.306 

-0.434 

0.263 

-0.176 

0.952 

1.556 

2.287 

2.353 

1.574 

0.882 

0.527 

-0.436 

0.258 

-0.279 

0.968 

1.262 

1.868 

2.055 

1.239 

0.951 

0.386 

-0.392 

0.294 

-0.264 

0.970 

1.300 

1.871 

2.053 

1.243 

D;  Water 

1.293 

0.446 

-0.550 

0.l9t 

-0.161 

0.864 

4.637 

5.148 

7.154 

4.792 

1.494 

0.486 

-0.382 

0.171 

-0.187 

0.814 

5.025 

5.617 

7.038 

5.110 

1.258 

0.517 

-0.585 

0.206 

-0.186 

0.885 

4.297 

4.777 

6.218 

4.505 

1.512 

0.448 

-0.528 

0.222 

-0.225 

0.887 

3.869 

4.176 

6.073 

3.876 

E:  Turbulence 

1.123 

0.728 

-0.637 

0.400 

-0.399 

0.946 

2.454 

2.972 

3.962 

2.521 

1.206 

0.811 

-0.587 

0.376 

-0.408 

0.929 

2.616 

3.052 

4.303 

2.699 

1.106 

0.595 

-0.769 

0.422 

-0.397 

0.945 

2.186 

2.733 

3.671 

2.250 

1.062 

0.799 

-0.526 

0.430 

-0.427 

0.945 

2.164 

2.611 

3.677 

2.152 

F:  .Approach 

1.099 

0.462 

-1.001 

0.268 

-0.231 

0.947 

2.175 

3.167 

2.661 

2.241 

1.076 

0.397 

-0.954 

0.266 

-0.206 

0.922 

2.668 

3.327 

3.791 

2.785 

1.028 

0.336 

-0.942 

0.248 

-0.186 

0.922 

2.366 

3.173 

3.272 

2.490 

1.018 

0.422 

-1.018 

0.331 

-0.257 

0.918 

2.597 

3.458 

3.375 

2.683 

G:  Roll 

1.182 

0.437 

-0.395 

0.095 

-0.584 

0.929 

2.952 

4.076 

3.523 

3.025 

1.204 

0.621 

-0.420 

0.083 

-0.663 

0.942 

3.257 

4.185 

4.077 

3.394 

1.032 

0.382 

-0.353 

0.053 

-0.660 

0.935 

2.923 

3.970 

3.627 

3.076 

1.087 

0.528 

-0.337 

0.1 10 

-0.725 

0.943 

2.788 

3.782 

3.597 

2.906 

TABLE  2 

Classification  Results 


Feature 

combination 

Correct 

classification 

Pcrccniagc 

success 

.All 

28 

100 

b.  d 

28 

100 

a.  d 

24 

85 

b.  c 

21 

75 

d 

21 

75 

b 

20 

71 

of  the  test  cases.  In  fact,  the  second-order  features  alone 
are  sufficient  for  successful  classification  in  all  cases. 

We  also  performed  a  principal  component  analysis  of 
these  features  to  gauge  the  relative  importance  of  differ¬ 
ent  features  in  producing  the  variation  in  the  sample  val¬ 
ues.  The  first  three  principal  components  of  the  entire 
data  set  arc  shown  in  Table  3.  Note  that  the  first  principle 
component  has  a  high  eigenvalue  and  relatively  high  pro¬ 
portions  of  the  second-order  features,  particularly  posi¬ 
tive  and  negative  divergence.  This  is  consistent  with  the 
finding  that  the  second-order  features  alone  are  sufficient 
for  classification  in  this  case.  The  principal  components 


TABLE  3 


Principle  Components 


Tomp 

a 

b 

c 

d 

Eigenvalue 

1 

0.29 

0.95 

0.02 

-0.29 

-0.20 

0.15 

Mm 

0.16 

54.95 

2 

-0.25 

0.38 

0.78 

0.07 

0.10 

5.82 

3 

-0.09 

^^3  B 

-0.50 

0.12 

-0.16 

0.01 

0.16 

0.06 

-0.00 

0.13 

2.54 
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FIG.  2.  (A)  Image  and  flow  for  paper  bands.  (B)  Image  and  flow  for  cloth.  (C)  Image  and  flow  for  plant  leaves.  (O)  Image  and  Row  for  water,  f  E) 
Image  and  flow  for  turbulent  motion. 


within  each  sample  contain  small  absolute  coefficients  for 
the  same  second-order  features,  showing  that  these  fea¬ 
tures  are  most  useful  in  classification. 

5.  CONCLUSION  AND  FUTURE  WORK 

We  have  described  a  method  of  motion  recognition 
using  temporal  textures.  This  technique  uses  statistical 
measures  of  local  motion  features  as  components  of  a 
feature  vector  that  can  be  used  in  standard  classification 
methods.  We  identified  several  motion  features  that  ap¬ 
pear  to  have  desirable  properties  for  recognition,  and  il¬ 


lustrated  their  utility  in  classifying  a  sample  of  real-world 
temporal  textures.  Future  work  includes  the  analysis  of 
other  feature  classes,  including  purely  temporal  features 
of  the  flow  as  well  as  Fourier  techniques. 

We  also  plan  to  extend  the  technique  to  the  recognition 
of  compact,  possibly  nonrigid,  objects.  This  differs  from 
textural  recognition  in  that  it  is  typically  the  detailed  ar¬ 
rangement  of  features  (in  space  and  time),  which  we  term 
the  object’s  action,  rather  than  regional  statistics,  that 
constitute  the  basis  for  identification.  Though  such  an 
extension  will  not  provide  a  general  solution  to  the  object 
recognition  problem,  we  think  that  there  are  a  number  of 
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situations  in  which  motion  recognition  techniques  can 
make  identification  far  easier  than  it  would  be  using  static 
image  processing  alone.  For  example,  something  big  and 
oblong  moving  smoothly  and  horizontally  in  the  vicinity 
of  a  road  is  probably  a  car.  This  conjunction  of  features  is 
fairly  simple  to  compute,  particularly  compared  to  the 
requirements  of  static  analysis,  which  must  be  able  to  tell 
cars  from  boulders,  architectural  clutter,  and  shadows  on 
the  road.  Similarly,  the  toad  in  Section  1  assumes  that 
anything  .small  (it  has  a  notion  of  distance  and  hence  of 
size  from  crude  stereo),  oblong,  and  moving  in  the  direc¬ 
tion  of  its  long  axis  is  good  to  eat  (or  at  least  is  worth  a 
taste). 

The  simplest  technique  is  to  use  conjunctions  of  mo¬ 
tion  and  geometric  features,  and  more  generally,  spatio- 
temporal  templates  specifying  the  rough  spatial  arrange¬ 
ment  of  motion  (and  geometric)  features.  More 
sophisticated  pattern  recognition  techniques  include  the 
generalized  Hough  transform  [Balls I ]  and  hypothesize- 
and-test  schemes  [GrimSfiJ.  A  candidate  for  handling 
time  sequences  for  which  a  fixed  template  is  insufficiently 
flexible  is  the  formalism  of  hidden  Markov  models 
[Baum70,  Jeli76.  Juan85]  These  have  been  used  primar¬ 
ily  for  speech  recognition,  but  the  technique  is  valid  fora 
w  ide  variety  dcscribable  by  sequence  of  discrete  symbols 
having  an  underlying  probabilistic  relationship. 
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Abstract 

Tlw  reoognitioa  of  repetitive  movements  char¬ 
acteristic  of  walking  people,  galloping  horses, 
or  flying  birds  is  a  routine  function  of  the 
human  visual  system.  It  has  been  demon¬ 
strated  that  humans  can  recognise  such  ac¬ 
tivity  solely  on  the  basis  of  motion  informa¬ 
tion.  We  present  a  novel  computational  a|>- 
proach  toe  detecting  such  activities  in  real  int- 
age  sequences  on  the  basis  of  the  periodic  na¬ 
ture  of  thrir  signatures.  Hie  approach  sug¬ 
gests  a  low-level  feature  based  activity  recog- 
nitkm  mechanism.  This  contrasts  with  earl^ 
modet-basod  approaches  for  recognising  sudi 
activities. 


1  Introduction 

The  motion  recognition  ability  of  the  human  visual  tys- 
tem  is  lematkable.  People  ace  able  to  distinguish  b^ 
hi^y  structured  motion,  such  as  that  pc^uced  by  . 
walking,  running,  swimming  or  flying  birds,  and  nwre 
statistic  patterns  such  as  that  due  to  blowing  snow, 
flowing  water  m  fluttering  leaves.  The  classic  demon- 
stratkm  of  puce  motion  recognition  by  humans  is  pro¬ 
vided  by  Moving  light  Display  experiments  (Johansson, 
1973].  More  subtle  movement  characteristics  can  be  dis- 
tinyiished  as  well.  For  example,  human  observers  can 
identify  the  actor’s  gender  and  even  identify  the  actor 
if  known  to  them  by  his  or  her  gait.  Similw  discrimi- 
nation  abilities  uring  motimi  alone  have  been  observed 
in  non-human  animals  as  well  [Ewart,  1987).  Hus  bio- 
lopcal  use  of  motion  probably  reflects  the  fact  that  for 
certain  tasks,  viraal  motion  provides  more  effective  cues 
than  other  niodes  of  visual  perception.  Motion  is  a  par¬ 
ticularly  useful  cue  for  certmn  types  of  recognition  due 
to  the  fact  that  it  is  relatively  easy  to  extract  the  mo-  . 
tion  field  independent  of  illumination  and  shading  of  the 
image. 

As  a  first  step  towards  motion  recognition  by  a  mar 
chine,  we  define  three  classes  of  motion  according  to  the 
spati^  and  temporal  uniformity  exhibited,  so  that  differ¬ 
ent  motions  can  be  recognized  using  different  techniques 
'appropriate  to  thrir  inherently  different  characteristics. 
We  define  temporal  textures  to  be  the  motion  patterns 
of  indeterminate  spatial  and  temporal  extent,  activitiea 


to  be  motion  patterns  which  are  u  dly  periodic  but 
ate  limited  in  tya^extmit,  and  m.  4  events  to  be  iso¬ 
lated  simple  motions  that  do  not  exhibit  any  temporal 
or  spatial  repetition.  Examples  of  temporal  textures  in¬ 
clude  wind  blown  trees  or  grass,  turbulent  flow  in  cloud 
patterns,  ripples  on  water,  the  motion  of  a  flodr  of  birds 
etc.  Exanq>le>  of  activities  are  walking,  running,  rotat¬ 
ing  or  reciprocating  machinery,  etc.  Examples  of  moticm 
events  are  isolated  instances  of  opening  a  ooor,  starting 
of  a  ear,  throwing  a  ball  etc. 

It  turns  out  that  temporal  textures  can  be  effectively  - 
treated  with  statistical  techniques  analogoos  to  those 
used  in  gray-levd  texture  discrimination.  A  pievions 
paper  [l^ana  and  Nelson,'  1992]  describes  this.  Activ¬ 
ities  and  motion  events,  on  the  other  hand,  are  more 
discretely  structured,  and  techniques  aiinilar  to  those 
used  in  static  object  recognition  would  be  e9q>ected  to 
be  useful  in  their  clasrification.  Since  different  softs  of 
techniques  must  be  used  to  distinguish  the  different  softs 
of  motkm,  it  would  be  useful  to  have  a  method  for  mak¬ 
ing  a  pteliminaty  classification  of  the  motkms  present  in 
an  image.  In  this  p^>er,  we  describe  a  robust  method 
for  detecting  and  localizing  periodic  activities,  including 
ones,  such  as  walking  or  flying,  that  invdve  simultane¬ 
ous  translation  of  the  actor.  The  method  is  based  on  fre- 
quenty  domain  analysis  of  an  image  in  which  low-level 
motion  infocmation  has  beat  used  to  isolate  and  track 
likdy  locations  of  activity.  The  method  also  suggests 
a  way  of  uring  low-level  structural  features  to  classify 
activities  mice  they  have  been  detected. 

Motion  recognition  techniques,  both  of  the  discrete 
and  textural  variety  have  the  potential  to  disairihiguate 
the  motions  of  different  origin.  The  motions  of  many 
natural  objects  can  be  clasrified  as  periodic  activities, 
including,  hpman  walking.  Duplication  of  the  recogni¬ 
tion  ability  of  these  motions  in  machine  tystems  would 
be  useful  in  a  number  of  applications,  such  as  automated 
sufveillaace .  Motion  detection  via  image  differencing  can 
be  used  for  intruder  detection;  however  such  systems  are 
subject  to  false  alarms,  especially  in  outdoor  environ¬ 
ments,  since  the  system  is  triggered  by  anything  that 
moves,  whether  it  be  a  human,  a  dog,  or  a  tree  blown 
by  the  wind.  Another  application  is  in  industrial  mon¬ 
itoring.  Many  manufacturing  operations  involve  a  long 
sequence  of  simple  operations  eadi  performed  repeatedly 
and  at  high  sp^  by  a  specialized  mechanism  at  a  par- 


ticular  location.  It  should  b«  possible  to  set  up  one  or 
more  fixed  cameras  that  cover  the  area  of  interest,  and 
to  characterise  the  allowed  motions  in  each  region  of  the 
image(s). 

2  Related  Work 

Although  motion  plays  an  important  role  in  biological 
recognition  tasks,  motion  recognition  in  general,  has  re¬ 
ceived  little  attention  in  the  literature  compared  to  the 
volume  of  work  on  static  object  recognition.  Most  com¬ 
putational  motion  work  in  motion  in  fact,  has  been  con¬ 
cerned  with  various  aspects  of  the  structure-from-motion 
problem.  There  is  a  large  body  of  psychophysical  liter¬ 
ature  addressing  the  perception  of  motion,  most  of  it 
concerned  with  primitive  percepts.  A  modest  amount 
of  this  work  addresses  mote  complicated  motion  recog¬ 
nition  issues  [Johansson,  1973,  Cutting,  1981,  Hoffinan 
and  Flinchbuagh,  1982,  Hildreth  and  Koch,  1987],  but 
the  models  and  descriptions  have  typically  not  bem  im¬ 
plemented.  Various  computational  models  of  tempo¬ 
ral  structure,  have  been  propo^  (e.g.  [Chun,  1986, 
Feldman,  19^])  but  much  of  this  work  is  at  a  fairly  high 
level  of  abstraction,  and  has  not  actually  been  applied 
to  visual  motion  recognition  except  in  rather  artificial 
tests. 

Goddard  [1989]  considers  recognizing  event  sequences 
from  Moving  Light  Display  (MLD)  images.  His  work 
addresses  the  representation  of  motion  event  sequences 
and  their  recognition  assuming  certain  invariant  im¬ 
age  features.  His  input  consists  of  the  joint  angles 
and  angular  velocities  computed  from  the  motion  of 
the  dots  in  the  light  displays.  The  joint  angles  and 
angular  velocities  are  invariani  to  rotation  in  the  im¬ 
age  plane,  scale  and  translation.  A  challengiiig  part 
in  computing  these  invariants  is  to  recover  the  con¬ 
nectivity  of  the  individual  dots  (by  body  parts)  in  the 
MLD  images.  A  domain  independent  ^mroach  to  this 
problem  is  ipven  by  Rashid.  Rashid  Ulashid,  1980, 
O’Rourke  and  Badler,  1980]  considered  the  computa¬ 
tional  interpretation  of  moving  light  displays,  particu¬ 
larly  in  the  context  of  gait  determination.  Tl^  wmrk  em¬ 
phasized  rather  high-level  symbolic  models  of  temporal 
sequences,  an  approach  made  possible  by  the  discrete  na¬ 
ture  of  the  moving  light  displays.  The  results  were  quite 
sensitive  to  discrete  errors  and  thus  highly  dependent 
on  the  ability  to  solve  the  correspondence  problem  and 
accurately  track  joint  and  limb  positions.  This  severely 
limits  the  general  applicability  of  the  method. 

A  few  studies  have  considered  highly  specific  aspects 
of  motion  recognition  computationally.  Pentland  [Pent- 
land  and  Mase,  1989]  considered  lip  reading,  and  imple¬ 
mented  a  system  that  could  recognize  spoken  digits  with 
70%-90%  accuracy  over  5  speakers.  The  system  required 
the  location  of  the  lips  to  be  entered  by  hand,  and  de¬ 
pended  on  an  explicitly  constructed  lip  model.  Some 
temporal  pattern  recognition  work  has  been  done  in  the 
context  of  speech  processing  [Juang  and  Rabiner,.  1985, 
Tank  and  Hopfield,  1987,  Elaman,  1988].  But  the  appli¬ 
cability  of  the  techniques  to  motion  recognition  has  not 
been  considered. 

Anderson  et  al.  (Anderson  et  ai,  1985]  describe  a 


method  of  change  detection  for  surveillance  applications 
based  on  the  spectral  energy  in  a  temporal  difference 
image.  This  was  not  generalized  to  other  motion  fea¬ 
tures  or  more  sophisticated  recognition.  Roller,  Heinze 
and  Nagel  [l99ll  develop^  a  system  that  tracks  mov¬ 
ing  vehicles  and  characterizes  their  trajectory  segments 
in  terms  of  natural  language  concepts.  Gould  and  Shah 
[1989]  represent  motion  characteristics  of  moving  objects 
by  recording  the  important  events  in  their  trajectory. 
They  propose  the  use  of  the  resulting  trajectory  primal 
sketch  in  a  motion  recognition  system.  Ailmen  and  Dyer 
have  developed  a  method  of  extracting  spatiotemporal 
carves  corresponding  to  moving  objects  and  applied  the 
technique  to  detection  of  cyclic  motions  [Allmen  and 
Dyer,  1990].  All  the  above  require  the  difficult  task  of 
robustly  computing  the  trajectories  or  spatiotempmal 
curves  from  image  sequences  before  attempting  recogni¬ 
tion,  and  the  demonstrations  of  their  techniques  involve 
only  synthetic  image  sequences. 

3  Activity  Detection 

Activities  involve  a  regularly  repeating  sequence  of  mo¬ 
tion  events.  If  we  conrider  an  image  sequence  as  a  spa¬ 
tiotemporal  solid  with  two  spatial  dimensions  z,  y  and 
one  time  dimension  t,  then  repeated  activity  tends  to 
^ve  rise  to  periodic  or  semi-periodic  gray  level  signals 
along  smooth  curves  in  the  image  solid.  We  refer  to 
these  curves  as  reference  carves.  If  these  curves  could  be 
identified  and  samples  extracted  along  them  over  several 
cycles,  then  frequency  domain  techniques  could  be  used 
in  order  to  judge  the  d^ree  of  periodicity. 

Before  defini^  the  reference  curves,  first  we  shall  for¬ 
malize  the  concept  of  a  periodic  object.  An  object  is 
defined  as  a  set  of  points  P.  Associated  with  eadi  p€  P 
is  a  function  Xp(f)  giving  its  location  (in  a  fixed  3D  co¬ 
ordinate  system)  as  a  fimetion  of  time.  A  stationary 
periodic  object  (ie.  a  stationary  object  exhibiting  peri¬ 
odic  activity)  has  the  property  that  Xp(t)  =  Xp{t  -I-  r) 
for  all  p  €  P,  where  r  is  the  time  period  for  one  cycle 
of  the  activity  and  is  independent  of  p.  We  now  define 
a  translating  periodic  object.  Such  an  object  has  the 
property  that  Jrp(()  =  l»(t)  -t-  Z(t),  where  Yp  satisfies 
Yp{t)  =  Yp{f  +  r)  and  Z\t)  is  a  path  in  3D  space  inde¬ 
pendent  of  p.  It  can  be  assumed  that  Z(0)  =  0  so  that 
Xp(0)  =  1^(0).  Intuitively,  a  periodic  object  character¬ 
ize  by  Yp(t)  is  translated  along  the  path  Z{t)  (we  are 
assuming  the  object  does  not  undergo  any  rotation  and 
the  viewing  angle  does  not  change).  If  we  compensate 
for  the  translation  of  the  object,  we  would  be  looking  at 
a  stationary  periodic  object  as  shown  by  the  equation: 
Xp{t)  -  Z(t)  =  Yp{t)  =  Ypit  +  r)  =  Xp(t  +  T)-  Z(t  +  r). 
Note  that  Z{t)  is  not  necessarily  periodic.  Note  also  that 
a  stationary  periodic  object  is  a  special  case  of  translat¬ 
ing  periodic  object  with  no  translation,  or  in  other  words 
Z(i)  —  0  for  all  t. 

CV>rresponding  to  each  point  p  of  a  translating  peri¬ 
odic  object,  we  define  a  3D-reference  curve  Rp{t)  to  be 
the  path  Xp(0)  -t-  Z{t).  We  also  define  a  2D-reference 
curve  rp{t)  corresponding  to  a  point  p  of  the  object,  to 
be  the  projection  of  Rp{t)  onto  the  image  plane  over  time 
(hence  rp(t)  is  a  curve  in  (i,y,  t)  space).  The  gray-level 


signal  along  the  2D-reference  curve  r,(i)  is  determined 
by  the  set  of  points  of  the  object  that  appear  along  the 
3D-reference  curve  Rp{t).  It  can  be  shown  that  the  same 
set  of  points  of  the  object  recur  periodically  along  each 
reference  curve  For  example,  the  point  p  is  on 

the  reference  curve  Rpit)  at  time  zero,  and  it  coincides 
with  the  reference  curve  at  regular  intervals  of  r  (since 
X,(t)  =  yp(r)  +  Z{r)  =  yp(0)  +  Zir)  =  X,(0)  +  Z{t)). 
Siimlarly,  every  other  point  of  the  object  on  the  reference 
curve  Rpit)  recurs  along  Rp{t)  at  intervak  of  r. 


Figure  1:  stationary  circular  rotation:  temporal  fre¬ 
quency  and  phase 

We  shall  illustrate  the  concept  with  two  examples,  one 
stationary  activity  (one  produced  by  a  stationary  peri¬ 
odic  object)  and  the  other  involving  a  uniform  transla¬ 
tion  of  the  actor,  i.e.  a  locomotory  activity.  If  the  activ¬ 
ity  is  stationary,  the  reference  curves  ate  lines  parallel  to 
the  temporal  dimension.  For  exanq>le,  a  circularly  ro¬ 
tating  ring  gives  rise  to  a  temporally  periodic  signal  at 
every  pixel.  This  is  illustrated  in  figure  1.  In  the  case  of 
uniform  translation,  the  curves  are  straight  lines  at  some 
angle  that  depends  on  the  velocity.  For  general  trans¬ 
lation  and  perspective  projection,  the  lines  associated 
with  a  given  actor  approaching  the  camera,  form  a  bun¬ 
dle  with  a  common  intersection,  the  vanishing  point.  For 
many  practical  situations,  however,  the  vanishing  point 
is  far  enough  removed  that  the  lines  can  be  considered 
to  be  effectively  parallel. 

Consider  the  case  of  human  walking.  This  is  an  ex¬ 
ample  of  a  non-stationary  activity;  that  is,  if  we  attach 
a  reference  point  to  the  person  walking,  that  point  does 
not  remain  at  one  location  in  the  image.  If  the  per¬ 
son  is  walking  with  constant  velocity,  however,  and  is 
not  too  close  to  the  camera,  then  the  reference  point 
moves  across  the  image  on  a  path  composed  of  a  con¬ 


stant  velocity  component  modulated  by  whatever  peri¬ 
odic  motion  the  reference  point  undergoes.  Thus,  if  we 
know  the  average  velocity  of  the  person  over  several  cy¬ 
cles,  we  can  compute  the  q>atiotemporal  line  of  motion 
along  which  the  periodicity  can  be  observed.  If  the  per¬ 
son  moves  with  average  velocity  (u,v)  t^  spatiotempo- 
ral  line  of  motion  will  be  determined  by  the  equations 
(*•  y)  =  (u,  v)  *  t  -f-  (xo,  po),  where  (x,  y)  is  the  position 
of  the  object  in  space  at  time  t  and  (xo,fo)  is  the  posi¬ 
tion  at  time  sero.  This  applies  to  any  object  undergoing 
constant  velocity  locomotion. 

3.1  Periodicity  Detection 

Ftom  Fourier  the(»y  we  know  that  any  periodic  signal 
can  be  decomposed  into  a  fundamental  and  harmonics. 
That  is,  we  can  consider  the  energy  of  a  periodic  ngnal  to 
be  concentrated  at  frequencies  whidi  ate  integral  multi¬ 
ples  of  some  fundamental  frequency.  Hiis  implies  that  if 
we  compute  the  discrete  Fourier  transform  of  a  sampled 
periodic  signal,  we  will  observe  peaks  at  the  fundamen¬ 
tal  frequency  and  its  harmonics.  Hence,  in  theory,  the 
periodicity  a  signal  can  be  detected  by  obtaining  its 
Fourier  transform  and  checking  whether  all  the  energy 
in  the  spectrum  is  contained  in  a  fundamental  frequency 
and  its  integral  multiples. 

The  real-world  signals,  however  are  seldom  perfectly 
periodic.  In  the  case  of  signals  arising  from  activity 
in  image  sequences,  disturbances  can  arise  from  errors 
in  the  uniform  translation  assumption,  varying  brudc- 
ground  and  lighting  behind  a  locomoting  actor,  and 
other  sources.  In  addition,  for  computational  purposes, 
we  need  to  truncate  the  rignal  at  some  finite  len^h  which 
may  not  be  an  exact  int^al  multiple  of  its  period.  Nev¬ 
ertheless,  the  frequency  defined  by  the  highest  amplitude 
often  represents  the  fundamental  frequency  of  the  signal. 
Hence  we  can  get  an  idea  of  the  periodicity  in  a  signal  by 
summing  the  energy  at  the  highest  amplitude  frequency 
and  its  multiples,  and  comparing  that  quantity  to  the 
energy  at  the  remaining  frequencies.  In  practice,  since 
peaks  in  a  Fourier  transform  tend  to  be  slightly  broad¬ 
ened  for  a  variety  of  reasons,  including  the  finite  length  of 
the  sample,  we  define  the  periodicity  measure  p/  of  a  sig¬ 
nal  /  as  a  normalised  difference  of  the  sum  of  the  power 
spectrum  values  at  the  highest  amplitude  frequency  and 
its  multiples,  and  the  sum  of  the  power  spectrum  values 
at  the  fr^uencies  halfway  between.  That  is, 

Pj  —  Fiw  ~  ^V«(  +  f'iw+w/7) 

i  i  >  I 

where  F  is  the  energy  spectrum  of  the  signal  /  and  v'  is 
the  frequency  corresponding  to  the  highest  amplitude  in 
the  energy  spectrum. 

The  measure  is  normalized  with  respect  to  the  total 
energy  at  the  frequencies  of  interest  so  that  it  is  one  for  a 
completely  periodic  signal  and  zero  for  a  flat  spectrum. 
In  general,  if  a  signal  consists  of  frequencies  other  than 
one  single  fundamental  and  its  multiples,  its  periodicity 
measure  will  be  low. 

Because  the  signal  along  any  given  reference  curve  in 
the  image  solid  may  be  ambiguous,  we  need  a  way  of 
combining  periodicity  measures  of  a  number  of  signals 


from  reference  curves  associated  with  the  same  actor. 
The  simplest  idea  would  be  simply  to  sum  the  power 
spectra  of  the  various  signals,  and  apply  the  periodic¬ 
ity  measure  to  the  resultant  curve.  Unfortunately,  this 
does  not  work,  primarily  because,  although  there  is  a 
fair  amount  of  energy  at  the  fundamental  frequency,  and 
quite  a  few  signals  in  which  high  periodicity  is  present, 
there  are  also  a  lot  of  samples  where  the  periodicity  is 
not  evident,  or  which  appear  periodic  at  some  other  fre¬ 
quency.  The  net  affect,  is  that  all  this  energy  at  other  fre¬ 
quencies  can  swamp  the  main  signal  if  they  are  combined 
^ditively.  What  does  work,  b  a  form  of  non-maximum 
suppression,  where  the  periodicity  measure  b  obtained 
for  each  power  spectrum  separately.  Each  frequency  w 
b  then  assigned  a  value  equal  to  the  sum  of  the  peri¬ 
odicity  measures  P«,  from  all  the  signab  whose  highest 
amplitude  occurred  at  that  frequency.  The  result  b  the 
same  as  suppressing  all  but  the  maximum  frequency  in 
each  transform,  weighting  each  by  the  periodicity  mea¬ 
sure  of  the  signal,  and  summing  them.  The  maximum 
value  of  this  combined  signal  b  t^en  as  the  fundamental 
frequency,  and  the  associated  periodicity  measure  b  the 
average  of  the  periodicity  measures  of  the  contributing 
signab. 

Thus,  the  periodicity  measure  P  for  an  entire  image 
sequence  b  defined  as 

P  =  max(PK/n«,) 

W 

where  riw  and  Pw  are  the  number  of  pixels  at  which 
the  highest  amplitude  frequency  b  tu  and  the  sum  of 
periodicity  measures  at  those  pixeb  respectively. 

Finally,  in  order  to  apply  t^  technique  to  real  data, 
we  need  a  way  of  extracting  reference  curves  and  the 
associated  signab  from  an  image  sequence.  In  the  fol¬ 
lowing,  we  assumed  that  any  activity  that  existed  in  the 
data  would  be  either  stationary,  or  locomotory  in  a  man¬ 
ner  that  produced  an  overall  translating  motion.  We  also 
assumed  that  there  was  at  most  one  actor  in  the  scene, 
though  a  certain  amount  of  background  motion  could  be 
tolerated.  A  third  assumption  b  that  the  viewing  angle 
and  the  scene  illumination  does  not  change  significantly 
so  that  the  intensity  along  the  reference  curves  remains 
periodic.  The  first  assumption  turns  out  not  to  be  too 
restrictive  -  a  large  numl^  of  natural  periodic  activi¬ 
ties  fit  into  one  of  the  two  categories.  The  second  can 
be  relaxed  with  some  additional  preprocessing.  Refer  to 
the  discussions  section  for  how  thb  can  be  achieved  and 
how  the  other  assumptions  can  be  relaxed  as  well. 

The  fust  step  of  the  algorithm  b  to  identify  locations 
in  the  scene  where  movement  of  any  sort  b  occurring. 
Thb  b  done  by  computing  the  normal  flow  magnitude 
at  each  pixel  between  each  successive  paur  of  frames  us¬ 
ing  a  spatiotemporal  differential  method.  Those  pixels  at 
which  significant  motion  b  present  are  marked,  auid  the 
centroid  of  the  marked  pixels  computed  in  each  frame. 
The  mean  velocity  (if  any)  of  the  actor  b  then  com¬ 
puted  by  fitting  a  linear  trajectory  to  the  sequence  of 
centroids.  Thb  b  where  the  one-actor  assumption  comes 
into  play.  If  several  actors  were  present,  simple  cluster¬ 
ing  techniques  could  be  used  to  isolate  the  regions  in 
the  scene  corresponding  to  different  activities.  The  ref¬ 


erence  curves  were  taken  as  the  lines  in  the  spatiotem¬ 
poral  solid  parallel  to  that  generated  by  the  linear-fitted 
trajectory  of  the  centroid.  Signab  were  extracted  along 
these  curves,  and  those  that  dbplayed  significant  spread 
over  a  period  of  at  least  half  as  long  as  the  signal  were 
selected  for  processing.  Thb  had  the  effect  of  eliminat¬ 
ing  the  need  to  process  regions  in  which  no  motion  oc¬ 
curred,  as  well  as  regions  affected  only  by  an  occasional 
blip.  The  periodicity  measures  for  all  signab  extracted 
b  computed  and  are  used  in  computing  periodicity  mea¬ 
sure  P  for  the  entire  image  sequence  as  described  above. 

3.2  Experiments 

We  ran  experiments  on  four  different  activities,  and  a 
numbtt  of  non-periodic  motions.  The  sequences  were 
first  recorded  on  video  and  then  digitised  later  with  suit¬ 
able  temporal  sampling  so  that  at  least  four  cycles  of  the 
activity  were  captured  in  128  frames.  Following  b  a  de¬ 
scription  of  each  activity  and  the  conditions  under  which 
they  were  digitized. 

•  Walk:  A  person  walking  across  a  room  viewed  in 
profile.  Six  sequences  of  128  frames  of  size  128x128 
pixeb  were  obtained.  Half  the  sequences  contained 
one  person  and  the  other  half  a  second. 

•  Elxercbe:  A  person  performing  jumping  jacks.  Four 
sequences  of  128  frames  of  128x128  pixeb,  two  each 
of  two  different  people. 

•  Swing:  A  person  swinging  viewed  from  the  side. 
Six  sequences  of  128  frames  of  128x128  pixeb,  three 
each  of  two  different  people. 

•  Fkog:  A  toy  frog  rimulating  swimming  activity 
viewed  from  above.  Four  sequences  of  128  frames 
of  64x256  pixeb. 

•  Nonperiodic:  Various  sequences  taken  from  televi¬ 
sion  shows  and  live  outdoor  shote:  splashing  wa¬ 
ter,  closeup  of  crowd  at  a  political  rally,  a  plane 
flying  overhead,  a  robot  hand  picking  up  and  ma¬ 
nipulating  objects  (2  sequences),  the  input  to  an 
eye  tracker  (eyeball  movements),  leaves  fluttering 
in  the  wind,  turbulent  flow  in  a  stream.  In  all,  8 
sequences  of  128  frames  of  128x128  pixeb. 

The  swing  and  exercise  activities  were  shot  outdoors  and 
contained  background  motion  as  well.  Among  the  peri¬ 
odic  activities,  a  tingle  sequence  of  uniform  rotation  b 
included  as  well.  Sample  images  of  these  activities  are 
shown  in  figures  2  and  3. 

The  periodicity  measures  computed  using  the  above 
algorithm  are  plotted  .for  all  20  periodic  and  all  8  non¬ 
periodic  sequences  in  figure  4.  As  b  evident  from  the 
graphs  and  the  projected  scatter  plot,  the  technique  sep¬ 
arates  complex  periodic  from  non-periodic  motion  nicely. 
The  requirement  that  an  empirically  determined  thresh¬ 
old  be  used  is  not  a  great  drawback  in  thb  case,  nor 
b  it  particularly  surprbing,  since  even  the  the  intuitive 
notion  of  periodic  activity  falls  on  a  continuum.  Is  the 
motion  of  a  branch  waving  somewhat  irregularly  in  the 
wind  periodic  or  non-periodic?  Here,  we  classified  it  as 
non-periodic,  but  it  had  one  of  the  higher  periodicity 
measures,  as  might  be  expected. 


4  Discussion 

Our  periodic  activity  detection  algorithm  can  be  sum¬ 
marized  as  follows: 

•  Input:  The  input  to  the  algorithm  is  a  digitized  im¬ 
age  sequence  consisting  of  128  frames  of  resolution 
128x128  pixels. 

•  Output:  A  periodicity  measure  indicating  the 
amount  of  periodicity  in  observed  in  the  image  se¬ 
quence.  This  is  used  to  decide  whether  the  image 
sequence  contains  a  periodic  activity  and  if  so,  to 
locate  the  region  of  the  activity. 

•  Step  1.  Compute  normal  flow  magnitude  at  each 
pixel  between  each  successive  pair  of  frames  using 
the  differential  method. 

•  Step  t.  Mark  pixels  corresponding  to  significant 
motion  in  the  scene  by  thresholding  the  normal  flow 
magnitude.  Conipute  centroid  of  the  marked  pixels 
in  each  frame.  Compute  the  mean  velocity  (if  any) 
of  the  actor  by  fitting  a  linear  trajectory  to  the 
sequence  of  centroids.  Take  reference  curves  to  be 
the  lines  in  the  spatiotemporal  solid  parallel  to  the 
linear  trajectory  of  centroids  of  motion. 

•  Step  S.  Extract  pay-level  signals  along  the  refer¬ 
ence  curves.  Compute  the  dominant  frequency  w 
and  the  periodicity  measure  /*«  for  each  individual 
signal  extracted. 

•  Step  4-  Compute  overall  periodicity  measure  P  for 
the  image  sequence  using  formula  given  in  the  last 
section. 

We  have  assumed  a  number  of  things  for  the  method 
to  work  correctly.  First,  we  assumed  that  there  is  only 
one  actor  in  the  scene  in  t^>prQximately  constant  linear 
locomotion,  and  that  the  motion  of  the  actor  is  signifi¬ 
cantly  higher  than  that  of  the  background  motion.  We 
also  assumed  that  the  viewing  angle  and  scene  illumina¬ 
tion  does  not  change  significantly.  Further,  it  is  assumed 
that  the  entire  image  sequence  consists  of  at  least  four 
cycles  of  the  periodic  activity  if  there  is  any.  The  follow¬ 
ing  is  a  discussion  of  some  of  the  merits  of  the  algorithm 
and  some  iq>proaches  to  deal  cases  where  the  assump¬ 
tions  are  violated. 

The  method  we  described  satisfies  the  several  desir¬ 
able  invariances.  It  is  invariant  to  image  illumination, 
contrast,  translation,  rotation  and  scale.  It  is  also  in¬ 
variant  to  the  magnitude  of  locomotory  motion  and  the 
speed  of  the  activity.  It  is  also  ftirly  robust  with  respect 
to  small  changes  in  viewing  angle.  The  periodicity  mea¬ 
sure  does  not  depend  on  the  number  of  pixels  involved 
in  the  activity.  If  desired,  a  restriction  on  the  minimum 
number  of  pixels  can  be  imposed  so  that  only  activities 
of  a  minimum  size  can  be  recognized.  The  swing  and 
i^ercise  sequences  were  taken  outdoors  where  there  is 
a  small  amount  of  background  motion.  This  comprises 
not  only  moving  trees  and  plants,  but  also  moving  peo¬ 
ple  and  occasional  crossing  of  a  car.  The  thresholding 
stage  on  motion  magnitude  in  step  2  of  the  algorithm 
(in  our  implementation  one-hrdf  pixel  per  frame  is  used) 
eliminates  small  background  motion,  but  it  can  not  elim¬ 
inate  larger  backpound  motion  such  as  produced  by  a 


cat  passing.  That  periodicity  can  be  detected  even  in 
this  case  demonstrates  that  the  technique  is  reasonably 
tolerant  of  backpound  clutter  and  an  occasional  distur¬ 
bance.  The  technique  also  provides  a  method  for  localiz¬ 
ing  activity  in  the  scene  by  back-projecting  the  reference 
curves  having  high  periodicity  measures  into  the  image 
solid. 

So  far  we  have  assumed  that  the  actors  giving  rise 
to  the  activity  move  with  constant  velocity  along  lin¬ 
ear  paths.  The  case  of  nonlinearly  moving  objects  can 
be  handled  by  tracking  the  object  of  interest  given  a 
coarse  estimate  of  its  initial  location  and  velocity.  This 
would  generate  reference  curves  that  were  not  straight 
lines.  We  have  already  demonstrated  the  inM-fiilntHt  of 
the  centroid  of  motion  for  computing  the  velocity  of  lin¬ 
early  moving  objects.  It  could  also  be  used  for  tracking 
the  actors  moving  on  more  complex  trajectories.  Use  of 
the  motion  centroid  can  be  unreliable  in  estimating-  the 
centroid  of  the  object  if  the  shape  of  the  object  changes 
as  it  moves.  In  this  case  use  of  a  prediction  and  correc¬ 
tion  mechanism  using  past  values  over  a  sufficiently  long 
period  can  help. 

The  detection  scheme  adso  assumes  that  there  is  only 
one  activity  in  the  scene  except  for  some  background 
clutter.  If  there  are  multiple  activities  in  the  scene, 
this  detection  technique  can  still  be  applied  provided 
the  activities  can  be  q>atially  isolated  so  that  they  do 
not  interfere  witii  each  other.  In  this  case  they  can  seg¬ 
mented  using  the  motion  information  and  later  track^ 
separately.  Even  an  occasional  crossing  of  different  ac¬ 
tivities  can  be  tolerated  as  long  as  the  regions  can  be 
separated  again  later.  In  our  experiments,  the  periodic 
activity  sanqiles  contist  of  at  least  four  cydes  of  the  ac¬ 
tivity.  Minimnm  four  cydes  were  used  to  detect  the  ac¬ 
tual  frequency  pven  that  there  is  a  conmderable  amount 
element  of  non-iepetitive  structure  from  the  background 
in  the  case  of  toanslating  actors. 

The  complexity  of  detection  is  proportional  to  the 
number  of  pixels  involved  in  the  activity.  About  half  the 
work  is  computing  the  fast  Fourier  transforms  at  each  of 
the  pixels.  Most  of  the  rest  of  the  time  is  occupied  by 
the  motion  detection  process.  The  detection  procedure 
currently  runs  on  an  SGI  machine  using  four  processors 
and  it  take  approximatdy  15  seconds  to  process  a  128 
frame  sequence  of  128x128  images. 

4.1  Recognition  of  Activities 

The  first  stage  in  recognizing  an  activity  is  to  detect  that 
an  activity  .exists,  and  localize  it  in  the  scene.  This  paper 
has  described  a  technique  for  accomplishing  this.  Future 
work  will  utilize  information  computed  in  the  detection 
stage  for  recognition  and  classification  of  specific  activi¬ 
ties.  The  detection  scheme  utilizes  only  the  magnitude  of 
the  Fourier  transform  to  obtain  the  periodicity  measure. 
The  phase  of  the  Fourier  transform  is  also  computed  at 
each  location  in  the  image  and  we  propose  to  use  this  in¬ 
formation  along  with  other  low-level  information  in  the 
image,  for  recognition.  For  example,  walking  can  be  de¬ 
scribed  as  a  sequence  of  motion  events  regularly  occur¬ 
ring  at  each  spatial  location.  The  cycle  of  motion  events 
at  different  spatial  locations  in  the  image  have  a  fixed 


phase  difference.  These  phase  differences  are  valuable  in 
characterizing  the  activities. 

5  Conclusion 

We  have  described  a  method  of  activity  detection.  This 
technique  uses  a  periodicity  measure  on  gray-level  signals 
extracted  along  spatiotemporal  reference  curves.  We 
have  illustrated  the  technique  using  real-world  examples 
of  activities,  and  shown  that  it  robustly  detects  complex 
periodic  activities,  while  excluding  non-periodic  motion. 
We  proposed  a  tedinique  to  recognize  these  activities  us¬ 
ing  the  detection  schone  described  here.  It  is  not  clear 
how  much  the  periodicity  alone  is  useful  for  recognition 
but  we  believe  the  phase  information  is  valuable  for  ac¬ 
tivity  recognition.  Fbture  work  will  concentrate  on  the 
development  of  robust  phase  features  that  can  be  used  in 
coqjttnction  with  previously  developed  motion  and  gray- 
level  features  to  classify  activities. 
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Figure  2:  Sample  images  from  periodic  sequences:  walk,  exercise,  swing  and  rotation 
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Figure  4:  Periodicity  measure  for  Periodic  and  Nonperiodic  sequences 
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Abstract 

The  raoognition  of  repetitive  movemento  diaracteristic  of  walking  people,  galloping  hoiaea,  or  flying  birds 
is  a  routine  function  of  the  human  visual  system.  It  has  been  demonstrated  that  humans  can  recognise  such 
activity  solely  on  the  basis  of  motion  information.  We  demonstrate  a  general  computational  method  for 
recognising  sudi  movements  in  real  image  sequences  unng  what  is  essentially  template  matdiing  in  a  motion 
feature  space  coupled  with  a  teduique  for  detecting  and  normalising  periodic  activities.  This  contrasts  with 
earlier  model-based  approaches  for  recognising  sudi  activities. 
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1  Introduction 


The  motion  recognition  ability  of  the  human  visual  system  is  remarkable.  People  are  able  to 
both  highly  structured  motion,  sudi  as  those  ixoduoed  by  walking,  running,  swimming  or  flying  »nimiil«  and 
birds,  and  more  statistical  patterns  sudi  as  those  due  to  blowing  snow,  flowing  water  or  fluttering  leaves. 
The  classic  demonstration  of  pure  motion  moognitKHi  by  humans  is  provided  by  Moving  Light  Disjday 
experiments  [Johanssmi,  1973],  where  human  subjects  were  able  to  distinguish  activities  sudi  as  walking, 
running  or  stair  climbing,  from  lights  attadied  to  the  joints  of  an  actor.  More  subtle  movement  characteristics 
can  be  distinguished  as  well.  For  example,  human  observers  can  identify  the  actor’s  gender,  and  even  identify 
the  actor  if  known  to  them,  by  his  or  her  gait.  Similar  discrimination  abilities  using  motion  alone  have  been 
observed  in  non-human  animals  as  well  [Ewart,  1987].  This  biological  use  of  motion  probably  reflects  the 
fact  that  for  certain  tasks,  visual  motion  provides  more  effective  cues  than  other  modes  of  visual  perception. 
Motion  is  a  particularly  useful  cue  for  certain  types  cf  recognition  due  to  the  fact  that  it  is  relatively  easy 
to  extract  the  motion  field  independent  of  illumination  and  shading  of  the  image. 

As  a  first  step  towards  motion  recognition  fay  a  madune,  we  define  three  common  dawji  of  visual  motion 
on  the  basis  of  the  spatial  and  temporal  r^ularity  of  the  signal.  Different  recognition  techniques  iq>ply  to 
the  different  classes.  We  define  the  first  dass,  temporal  textures  to  be  motion  patterns  that  exhibit  statistical 
regularity  but  have  indetenninate  spatial  and  temporal  extent.  Examples  of  temporal  textures  indude  wind 
blown  trees  or  grass,  turbulent  flow  in  cloud  patterns,  tipples  on  water,  the  motion  of  a  flock  of  birds  etc. 
The  second  dass,  activities,  consists  of  motion  patterns  that  ate  temporally  periodic  and  possess  compact 
spatial  structure.  Examples  of  activities  include  walking,  running,  rotating  or  redprocating  machinery,  etc. 
A  third  dass  motion  events  consists  of  isolated  simple  motions  that  do  not  exhibit  any  temporal  or  spatial 
repetition.  Examples  of  motion  events  are  isolated  instances  of  opening  a  door,  starting  of  a  car,  throwing 
a  ball  etc.  On  can  imagine  other  combinations  of  attribute,  e.g.  spatially  periodic  and  temporally  limited, 
but  these  to  not  seem  to  occur  broadly  in  natural  visual  environments. 

It  turns  out  that  temporal  textures  can  be  effectively  treated  with  statistical  techniques  analogous  to 
those  used  in  gray-level  texture  discrimination.  A  previous  paper  [Polana  and  Nelson,  1992]  describes  this. 
.Activities  and  motion  events,  on  the  other  hand,  are  more  discretely  structured,  and  techniques  similar  to 
those  used  in  static  object  recognition  would  be  expected  to  be  useful  in  their  dassification. 

In  tins  paper,  we  describe  a  robust  method  for  recognizing  activities,  induding  ones,  such  as  walking,  that 
involve  simultaneous  translation  of  the  actor.  In  an  earlier  paper  [Polana  and  Nelson,  1993],  we  described 
an  algorithm  to  detect  periodic  activities  in  an  image  sequence  making  use  of  the  periodic  nature  of  the 
activity.  The  recognition  algorithm  utilizes  the  periodic  activity  detection  algorithm  as  a  first  step  in  the 
computation  of  a  normalized  a  feature  vector  whidi  is  then  used  to  classify  detected  activity  as  one  of  several 
known  activities. 


4 


Motion  recognition  algorithms,  both  for  temporal  texture  and  activity,  have  potential  applications  in 
several  areas.  One  area  is  automated  surveillancer.  Motion  detection  xia  image  differencing  can  be  used  for 
intruder  detection:  however  such  systems  are  subject  to  false  alarms,  especially  in  outdoor  environments, 
since  the  system  is  triggered  by  anj'thing  that  moves,  whether  it  is  a  person,  a  dog,  or  a  tree  blown  by  the 
wind.  Motion  recognition  techniques  can  be  used  disambiguate  such  situations.  Another  application  is  in 
industrial  monitoring.  Many  manufacturing  operations  involve  a  long  sequence  of  simple  operations  each 
performed  repeatedly  and  at  high  speed  by  a  specialised  medtaiusm  at  a  particular  location.  It  should  be 
possible  to  set  up  one  or  more  fixed  cameras  that  cover  the  area  of  interest,  and  to  characterize  the  allowed 
motions  in  each  region  of  the  image(s). 

2  Related  Work 

Although  motion  plays  an  important  role  in  biological  recognition  tasks,  motion  recognition  in  general,  has 
received  little  attention  in  the  literature  compared  to  the  volume  of  work  on  static  object  recognition.  Most 
computational  motion  work  in  motion  in  fact,  has  been  concerned  with  various  aspects  of  the  structure-from- 
motion  problem.  There  is  a  large  body  of  psychophysical  literature  addressing  the  perception  of  motion, 
most  of  it  concerned  with  primitive  percepts.  A  mod<^i.  amount  of  this  work  addresses  more  complicated 
motion  recognition  issues  [Johansson,  1973,  Cutting,  1981,  Hoftnan  and  Flinchbuagh,  1982,  Hildreth  and 
Kodi,  1987],  but  the  models  and  descriptions  have  typically  not  been  implemented.  Various  computational 
models  of  temporal  structure,  have  been  proposed  (e.g.  [Chun,  1986,  Feldman,  1988])  but  much  of  this  work 
is  at  a  fairly  high  level  of  abstraction,  and  has  not  actually  been  ^plied  to  visual  motion  recognition  except 
in  rather  artificial  tests. 

A  specialized  area  that  has  seen  some  attention  is  the  interpretation  of  moving  light  displays.  Goddard 
.  [1989]  considers  recognizing  event  sequences  from  such  images.  His  work  addresses  the  representation  of 
motion  event  sequences  and  their  recognition  assuming  certw  invariant  image  features.  His  input  consists 
of  the  joint  angles  and  angular  velocities  computed  from  the  motion  of  the  dots  in  the  light  displays.  The 
joint  angles  and  angular  velocities  are  invariant  to  rotation  in  the  image  plane,  scale  and  translation.  A 
challenging  part  in  computing  these  invariants  is  to  recover  the  connectivity  of  the  individual  dots  (by  body 
parts)  in  the  MLD  images.  A  domsun  independent  ^proach  to  this  problem  is  given  by  Rashid  [Rashid,  1980, 
O’Rourke  and  Badler,  1980].  This  work  considers  the  computational  interpretation  of  moving  light  displays, 
particularly  in  the  context  of  gait  determination.  This  work  emphasized  rather  high-level  symbolic  models 
of  temporal  sequences,  an  approach  made  possible  by  the  discrete  nature  of  the  representation.  The  results 
were  quite  sensitive  to  discrete  errors  and  thus  highly  dependent  on  the  ability  to  solve  the  correspondence 
problem  and  accurately  track  joint  and  limb  positions.  This  severely  limits  the  general  applicability  of  the 
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method. 

A  few  studies  have  considered  highly  specific  aspects  of  motion  recognition  computationally,  .\nderson  et 
al.  [Anderson  et  al.,  1985]  describe  a  method  of  change  detection  for  surveillance  applications  based  on  the 
^>ectral  energy  in  a  temporal  difference  image.  This  was  not  generalised  to  other  motion  features  or  more 
sophisticated  reoogniticm.  Pentland  [Pentland  and  Mase,  1989]  consideted  lip  reading,  and  implemented  a 
system  that  could  recognize  spoken  digits  with  70%-90%  accuracy  over  5  speakers.  The  system  required 
the  locatkm  ci  the  lips  to  be  entered  by  hand,  and  dq^ended  on  an  explicitly  constructed  lip  model.  Some 
temporal  pattern  recognition  work  has  been  done  in  the  context  of  speech  processing  [Juang  and  Ralnner, 
1985,  Thnk  and  Hopfiehl,  1987,  Eiaman,  1988],  but  the  applicability  of  the  techniques  to  motion  recognition 
has  not  been  considered. 

Finally,  there  is  a  body  ci  work  based  on  the  analym  of  trajectories.  KoUer,  Heinze  and  Nagel  [l99l] 
developed  a  system  that  trada  moving  vehicles  and  diaracterises  their  trajectory  s^ments  in  terms  of  natural 
language  oonoq>ts.  Gould  and  Shah  [1989]  represent  motitm  characteristics  of  moving  objects  by  recording 
the  important  events  in  thdr  trajectory.  They  propose  the  use  of  the  resulting  trajectory  primal  sketch  in  a 
motion  recognition  system.  Allmen  and  Dyer  have  devdoped  a  method  of  extracting  spatiotemporal  curves 
corresponding  to  moving  objects  and  applied  the  technique  to  detection  of  cyclic  motions  [Allmen  and  Dyer, 
1990].  Tsai  et  al.  [Tsai  et  al.,  1993]  have  also  worked  on  cyclic  motion  detection  using  curvature  trajectories 
to  detect  cycles  by  means  of  Fourier  domain  tedmiques.  All  the  above  require  the  difficult  task  of  robustly 
computing  the  trajectories  or  spatiotemporal  curves  frnn  image  sequences  before  attempting  recognition 
and  the  demonstrations  of  their  techniques  involve  prindpally  synthetic  image  sequences. 

3  Detecting  Activities 

The  first  step  in  recognizing  an  activity  is  to  determine  that  an  activity  exists,  and  localize  it  in  the  scene. 
In  an  earlier  paper  we  have  described  a  tedinique  for  accomplishing  this  [Polana  and  Nelson,  1993].  The 
present  work  will  utilise  the  information  computed  in  the  detection  stage  for  recognition  and  classification 
of  spedfic  activities. 

Activities  involve  a  r^ularly  repeating  sequence  of  motion  events.  If  we  consider  an  image  sequence  as  a 
spatiotemp<»al  solid  with  two  spatial  dimensions  z,  y  and  me  time  dimension  t,  then  repeated  activity  tends 
to  give  rise  to  periodic  or  semi-periodic  gray  level  signals  along  smooth  curves  in  the  image  solid.  We  refer  to 
these  curves  as  reference  curves.  If  these  curves  could  be  identified  and  samples  extracted  along  them  over 
several  cydes,  then  frequency  domain  techniques  could  be  used  in  order  to  judge  the  degree  of  periodidty. 

To  clearly  define  the  reference  curves,  we  need  to  formalize  the  concept  of  a  periodic  object.  An  object 
is  defined  as  a  set  of  pdnts  P.  Associated  with  each  p  €  P  is  a  function  Xp(t)  giving  iu  location  (in  a 
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fixed  3D  coordinate  system)  as  a  function  of  time.  A  stationary  periodic  object  (ie.  a  stationary'  object 
exhibiting  periodic  activity)  has  the  property  that  X,(t)  =  Xf{t  +  r)  for  all  p  £  P,  where  r  is  the  time 
period  for  one  cycle  of  the  activity  and  is  independent  of  p.  A  slight  generalization  gives  us  a  definition  for 
a  translating  periodic  object.  Such  an  object  has  the  property  that  Xp(t)  =  Yp(t)  +  Z(t).  where  Vp 

~  Z{t)  is  a  path  in  3D  space  indq>endent  p.  It  can  be  assumed  that  Z(0)  =  0  so  that 

=  ^(0)-  Intuitively,  a  periodic  object  characterised  by  l^(t)  is  translated  along  the  path  2{t)  (we 
are  assuming  the  object  does  not  undergo  any  rotation  and  the  viewing  angle  does  not  change). 

If  we  can  determine  the  translational  path  of  the  object  by  some  sort  of  tracking  procedure,  then  we  need 
only  consider  stationary  periodic  objects  as  shown  by  the  equation;  Xp{t)  —  Z{t)  =  Yp{t)  =  Yp{t  +  t)  = 
Xp{t  +  t)  -  Z{t  +  r).  More  formally,  corresponding  to  each  point  p  of  a  translating  periodic  object,  we 
define  a  3D-teference  curve  Rp{t)  to  be  the  path  Xp(Q)  +  Z{t).  We  also  define  a  2E>-reference  curve  r,(f) 
corresponding  to  a  point  p  of  the  object,  to  be  the  projectkm  of  Rp(t)  onto  the  image  plane  over  time  (hence 
r,(f)  is  a  curve  in  (x,y,t)  space).  The  gray-level  signal  along  the-2D-reference  curve  rp{t)  is  determined  by 
the  set  of  points  of  the  object  that  appear  along  the  3D-tefeience  curve  Ilp(t).  It  can  be  shown  that  the 
same  set  of  points  of  the  object  recur  periodically  along  each  reference  curve  Itp{t).  For  example,  the  point 
p  is  on  the  reference  curve  Ap(()  at  time  zero,  and  it  coinddes  with  the  reference  curve  at  regular  intervals 
of  r  (since  Xp{r)  =  Yp{r)  +  Z{r)  =  1|,(0)  +  Z{t)  =  X,(0) + Z{t)).  Similarly,  every  other  point  of  the  object 
on  the  reference  curve  Rp{i)  recurs  along  Rp{t)  at  intttvab  of  r. 

Given  an  image  sequence  containing  a  moving  object,  the  detectitm  sdieme  works  as  follows:  First, 
the  object  is  tracked  udng  a  low-level  process  based  on  aggr^ation  of  moving  pixels.  The  trade  is  used 
to  generate  reference  curves  and  sample  motion  signals  are  extracted  along  -them.  Each  of  the  signals  is 
processed  using  frequency  domain  techniques  to  compute  a  measure  of  periodidty.  The  periodidty  measures 
of  individual  signals  are  combined  to  produce  a  periodidty  measure  for  the  entire  tracked  object,  which  is 
then  thresholded  to  dedde  whether  a  periodic  activity  is  present  in  the  sequence. 

The  following  is  a  step-by-step  description  of  the  periodic  activity  detection  algorithm: 

•  Input:  The  input  to  the  algorithm  is  a  digitized  256-ievd  gray-valued  image  sequence. 

•  Output:  A  periodicity  measure  indicating  the  amount  of  periodidty  in  observed  in  the  image  sequence. 
This  is  used  to  dedde  whether  the  image  sequence  contains  a  periodic  activity  and  if  so,  to  locate  the 
region  of  the  activity. 

•  Step  1.  Compute  normal  flow  magnitude  at  each  pixel  between  each  successive  pair  of  frames  using  a 
differential  method. 

•  Step  2.  Mark  pixeb  corresponding  to  significant  motion  in  the  scene  by  thresholding  the  normal  flow* 
magnitude.  Compute  centroid  of  the  marked  pixels  in  each  frame.  Compute  the  mean  velocity  (if 
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any)  of  the  actor  by  fitting  a  linear  trajectory  to  the  sequence  of  centroids.  Take  reference  carves  to 
be  the  lines  in  the  spatioteinporal  solid  parallel  to  the  linear  trajectory  of  centroids  of  motion.  This 
simple  tracking  process  is  currently  adapted  to  a  single  actor  moving  linearly,  but  is  easily  extended  to 
multiple  actors  and  other  paths  as  long  as  the  trades  are  smooth,  and  the  actors  are  separated  most 
of  the  time. 

•  Step  3.  Extract  motion  signals  along  the  reference  curves.  Compute  the  dominant  frequent^  and  the 
periodidty  measure  for  each  individual  signal  extracted.  We  define  the  periodidty  measure  p/  a 
signal  /  as  a  normalized  difference  of  the  sum  ot  the  power  spectrum  values  at  the  highest  amplitude 
frequency  and  its  multiples,  and  the  sum  of  the  power  spectrum  values  at  the  frequendes  halfway 
between.  That  is, 

Pj  =  Pirn  -  ^  F{iw+w/7))/{^  n) 

I  I  i 

where  F  is  the  energy  spectrum  of  the  signal  /  and  w  is  the  frequency  corresponding  to  the  highest 
amplitude  in  the  energy  spectrum. 

•  Step  4-  For  each  frequency  w  assign  a  value  equal  to  the  sum  of  the  periodidty  measures  from  all 
the  signals  whose  highest  amplitude  occurred  at  that  frequency.  Compute  overall  periodidty  measure 
P  for  the  image  sequence  using  formula  P  —  mxx^{Pw/n^)  where  n..  and  P«  are  the  number  of  pixels 
at  which  the  highest  amplitude  frequency  is  to  and  the  sum  of  periodidty  measures  at  those  i^els 
respectively. 

A  more  complete  discussion  of  the  periodidty  detection  process  and  the  assumptions  made  can  be  found 
in  the  previously  dted  paper. 

4  Recognizing  Activities 

Once  an  ’activity  has  been  identified  and  tracked  in  a  scene,  the  next  step  is  to  recognize  it.  The  tracking  and 
periodidty  detection  algorithms  provide  spatial  and  temporal  normalization  that  can  be  used  to  simplify  the 
recognition  procedure.  In  particular,  recall  that  the  periodidty  detection  procedure  provides  a  periodidty 
measure  for  each  active  pixel  in  a  tracked  object.  By  badcprojecting  this  measure,  we  can  locate  the  pixels  in 
eadi  frame  that  display  periodicity  at  the  dominant  frequency.  Since  these  pixels  are  likely  to  belong  to  the 
actor  of  interest,  we  can  use  this  backprojection  to  refine  our  initial  segmentation,  which  was  based  solely 
on  aggregate  motion.  By  fitting  a  frame  to  this  refined  segmentation  we  compensate  for  variation  in  spatial 
scale  and  position.  Currently  this  is  done  on  the  assumption  that  the  distance  of  the  actor  does  not  change 
significantly  over  the  sample  (typically  4  cycles),  but  a  simple  change  in  the  frame-fitting  procedure  can  allow 
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for  smooth  scale  change  as  might  i^esult  during  an  approadiing  motion.  Similarly,  the  fundamental  frequency 
allows  us  to  frame  the  activity  in  time,  and  compensate  for  variation  in  temporal  scale  (i.e.  fluency). 

The  end  result  of  the  normalization  procedure  is  a  spatio-temporal  solid  containing  the  activity  of  interest 
in  a  form  that  is  invariant  to  spatial  scale,  spatial  translation,  and  temporal  scale.  The  next  step  is  to 
compute  a  descriptor  for  this  solid  that  can  be  used  to  classify  the  activity  it  represents.  This  sounds  like 
a  three  dimensional  template  match,  and  in  fact,  with  the  appropriate  motion  features  in  the  slots  of  the 
template,  such  an  approach  works  well.  Essentially,  we  capitalize  on  the  fact  that  a  periodic  activity  is 
characterized  by  regularly  repeating  motion  events  that  have  fixed  spatial  and  temporal  relationships  to 
each  other. 

In  more  detail,  the  process  is  as  follows.  VVe  divide  one  cycle  of  the  spatio-temporal  solid  representing  the 
activity  into  XxYxT  cells  by  partitioning  the  two  spatial  dimensions  into  X,  Y  divisions  respectively  and  the 
temporal  dimension  into  T  divisions.  We  then  select  a  local  motion  statistic  and  compute  the  same  statistic 
in  each  cell  of  the  spatiotemporal  grid.  The  feature  vector  in  this  case  is  composed  of  XYT  elements  each  of 
which  is  the  value  of  the  statistic  in  a  particular  cell  -  essentially  a  three  dimensional  template. 

One  issue  that  affects  the  measures  described  above  is  the  fact  that  so  far,  the  normalized  spatio-temporal 
solid,  while  corrected  for  temporal  scale  (frequency)  is  not  corrected  for  temporal  translation  (phase).  There 
are  a  couple  of  ways  to  handle  this.  One  is  to  pick  some  robust  temporal  feature  to  define  zero  phase,  and 
normalize  all  samples  with  respect  to  this  feature.  One  feature  that  works  fairly  robustly  is  to  take  the  time 
of  maximum  difference  between  total  motion  in  the  left  and  right  half  fields.  Alternatively,  since  the  pattern 
matching  phase  of  the  algorithm  currently  represents  only  a  small  fraction  of  the  total  ocunputational  effort, 
and  the  temporal  resolution  of  the  pattern  is  typically  small  (i.e.,  less  than  10  samples  per  cycle),  we  can 
simply  try  a  match  at  each  passible  phase  and  pick  the  best.  We  have  found  in  our  experiments  that  this 
method  works  better  than  the  first.  Hence,  the  results  aue  reported  using  this  kind  of  matching  only. 

We  experimented  with  three  different  local  statistics.  The  first  was  the  dominant  motion  direction  in 
eadi  cell.  This  is  approximated  by  computing  the  histogram  of  normal  flow  directions  weighted  by  the 
corresponding  normal  flow  magnitude  and  selecting  the  direction  with  highest  histogram  value.  The  second 
statistic  represented  the  summed  motion  magnitude  in  the  dominant  motion  direction.  The  third  statistic  is 
simply  the  summed  normal  flow  magnitude  in  each  cell.  The  directional  information  is  ignored  in  this  case. 
As  it  turned  out,  this  last  statistic,  whidi  is  some  ways  the  nmplest,  worked  best. 

4.1  Experiments 

We  ran  experiments  on  seven  different  types  of  activities.  The  image  sequences  were  first  recorded  on  video 
and  then  digitized  later  with  suitable  temporal  sampling  so  that  at  least  four  cycles  of  the  activity  were 
captured  in  128  frames.  Following  is  a  description  of  each  activity  and  the  conditions  under  which  they  were 
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digitized. 

•  Walk:  A  person  walking  on  a  treadmill. 

•  Exercise:  A  person  exerciung  on  a  machine. 

•  Jump:  A  person  performing  jumping  jades. 

•  Swing:  A  person  swinging  viea-ed  from  the  side. 

•  Run:  A  person  running  on  a  treadmill. 

•  Sid:  A  person  skiing  on  a  skiing  machine. 

•  Ftog:  A  toy  frog  simulating  swimming  activity  viewed  from  above. 

All  samples  were  digitized  at  a  spatial  resolution  of  128x128  pixds,  except  those  for  walk  and  run  whidi 
were  digitized  at  a  resolution  of  64x128  pixels.  Pixels  were  8  bit  gray  levels.  The  swing  and  exercise  activities 
were  shot  outdoors  and  obtained  background  motion. 

We  first  digitised  eight  samples  of  each  activity  by  the  same  person  under  the  same  conditions  with  respect 
to  seme  illumination,  background,  and  camera  position.  We  created  the  reference  database  taking  half  of  the 
samides  bdonging  to  eadi  activity.  In  other  words,  the  nCeteoee  database  consists  of  four  samples  of  each 
ci  the  seven  activities.  Sample  images  of  these  activities  are  shown  in  figures  1.  The  remaining  four  samples 
<d  each  activity  are  used  to  create  the  test  database.  In  addition,  we  diptised  four  samples  of  walking  hy  a 
different  person  and  eight  samples  of  the  frog  under  different  lighting  conditions  and  different  background 
and  for^round  gradients.  These  samples  also  differed  from  the  reference  database  in  frequency,  speed  of 
motion,  and  spatial  scale.  Examples  of  these  samples  are  shown  in  Figure  2  These  samples  were  added  to  the 
test  database.  The  samples  in  the  test  database  were  dassified  by  a  nearest  centrmd  classification  technique 
using  the  samples  in  the  reference  database  as  training  set. 

We  conducted  experiments  using  the  three  local  motion  statistics  described  above.  In  each  case  the  feature 
vector  consists  ci  the  local  statistic  computed  over  each  of  a  set  of  cells  omistituting  a  partition  of  the  spatio- 
temporal  solid.  We  divided  each  spatial  dimension  into  four  diviaons  and  the  temporal  dimension  into  ux 
divirions,  so  that  we  get  a  feature  vector  of  length  96.  lb  rriterate,  the  three  local  statistics  were:  direction  of 
maximum  motion  (where  the  directions  are  quantized  into  eight  sectors),  the  motion  m^nitude  in  maximum 
motion  direction,  and  total  motion  magnitude  in  eadi  cell.  Sample  features  vectors  are  illustrated  in  3  using 
the  total  motion  magnitude  statistic  for  a  walk  and  a  run  sequence. 

We  initially  computed  the  feature  vectors  by  finding  a  zero  phase  marker  within  a  cycle  using  the  method 
described  previously.  However,  more  reliable  results  were  adiieved  by  matdiing  each  test  feature  vector  with 
the  reference  feature  vector  six  times,  corresponding  to  different  temporal  offsets,  and  choosing  the  best 
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match  obtained.  The  results  reported  below  utilise  the  latter  method.  This  classification  resulted  in  correct 
classification  of  every  sample  in  the  test  database,  including  the  samples  using  a  different  actor  and  different 
backgrounds,  which  were  not  represented  in  the  reference  database. 

The  results  of  classification  using  different  variations  are  shown  in  terms  of  percentage  of  test  cases 
correctly  classified  in  table  1.  Somewhat  to  our  surprise,  the  simplest  statistic  -  the  total  motion  magnitude 
gave  better  results  than  either  of  the  statistics  involving  direction  of  motion.  The  reason  for  this  turned 
out  to  be  related  to  the  resolution  of  our  images.  In  order  to  digitize  enough  frames  to  test  the  technique, 
we  had  subsampled  the  images  to  128  x  128  pixeb.  After  filtering  for  periodicity,  significant  motion,  and 
direction,  it  was  often  the  case  that  few  pixels  with  all  these  properties  were  left  in  any  one  cell,  whidi  made 
for  a  large  amount  of  stochastic  noise  in  the  signaL  Simply  put,  we  didn’t  have  high  enough  resolution  data 
to  appropriately  utilize  the  more  specific  statistics. 

The  percentage  of  correct  classification  does  not  give  a  full  indication  for  the  quality  of  classification. 
Hence,  we  also  illustrate  the  results  by  the  confusion  matrix  which  shows  how  closely  test  samples  belonging 
to  various  classes  match  the  reference  samples  of  the  diffnent  classes.  The  confusion  matrix  using  the  total 
motion  magnitude  statistic  is  shown  in  Figure  4.  A  large  square  indicates  a  good  match.  As  can  be  seen  from 
this  table,  some  motions,  for  instance  the  swimming  ftog,  do  not  resemble  anything  else  in  the  database, 
while  others,  for  instance  running  and  skiiing,  are  mcxe  Uledy  to  be  confused’.  The  results  seem  to  correspond 
mote  or  less  to  human  intuition  about  how  nnular  the  motions  are. 


Feature  vector 

Total  Teat 

Correct 

Percent 

Failures 

Samples 

CUusified 

Success 

direction  of  maximal  motion 

40 

32 

H 

walk  by  different  actor  and 

frog  under  different  gradients 

magnitude  in  maximal  direction 

40 

39 

97.5 

walk  by  different  actor 

total  motion  mi^nitude 

40 

40 

100 

None 

Table  1:  Classification  results 


5  Discussion 

The  following  is  a  step-by-step  description  of  the  periodic  activity  recognition  algorithm: 

•  tnput:  The  input  to  the  algorithm  is  a  digitized  256-Ievel  gray- valued  image  sequence  consisting  of  at 
least  four  cycles  of  a  p>eriodic  activity. 

•  Output:  A  known  class  into  which  the  activity  is  classified  by  the  algorithm. 
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•  Step  I.  Compute  nonnal  flow  magnitude  at  each  pixd  between  each  successive  pair  of  frames  using  a 
differential  method. 

•  Step  i.  Locate  and  trade  the  activity  in  the  image  sequence  using  periodidty  detection  algorithm 
described  in  section  2. 

•  Step  S.  Normalize  the  activity  using  pixels  exhibiting  periodic  modon  and  compute  a  feature  vector. 

•  Step  4-  Classify  the  activity  using  nearest  centread  algorithm. 

The  method  we  have  described  displays  several  dearaUe  invatianoes.  It  is  robust  to  varying  image 
illumination  and  contrast  because  the  method  uses  only  motkm  infocmatkm  which  is  invariant  to  these.  It  is 
also  invariant  to  spatial  and  temporal  translatimi  and  scale  due  to  the  nonnalization  of  the  feature  vectors, 
and  the  multiple  temporal  matdiing.  It  is  also  fairly  robust  witii  teq>ect  to  small  changes  in  viewing  angle. 
The  saring  and  exercise  sequences  were  taken  outdoors  where  there  is  a  small  amount  of  background  motion. 
This  comprises  not  only  moving  trees  and  plants,  but  also  moving  people  and  an  occasional  crosnng  of  a  car. 
That  the  activities  can  be  detected  even  in  this  case  demonstrates  that  the  technique  is  somewhat  tolerant 
of  background  clutter  and  the  occasional  disturbance. 

To  understand  how  mudi  background  clutter  can  be  tolerated  by  this  tedinique,  we  have  experimented 
with  the  walk  samples  by  adding  motion  clutter  produced  blowing  leaves  This  structured  motion  clutter 
is  added  in  a  amtrolled  fashion  so  that  its  mean  magnitude  represents  a  varying  percentage  of  the  mean 
magnitude  of  the  signal,  and  the  resulting  samples  are  clasrified  umg  the  total  motion  magnitude  statistic. 
The  results  are  tabulated  in  2.  The  results  show  that  the  recognition  scheme  can  tolerate  motion  clutter 
whose  magnitude  is  equal  to  one  half  that  of  the  activity,  and  it  displays  degraded,  but  still  useful  performance 
for  even  higher  clutter  magnitudes. 
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We  have  assumed  that  the  actors  giving  rise  to  the  activity  move  with  constant  velocity  along  linear 
paths.  The  case  of  nonlinearly  moving  objects  can  be  handled  by  tracking  the  object  of  interest  given  a 
coarse  estimate  of  its  initial  location  and  velocity,  (e.g.  with  a  Kalman  filter).  This  would  generate  reference 
curves  that  are  not  straight  lines.  We  have  already  demonstrated  the  usefulness  of  the  centroid  of  motion 
for  computing  the  velocity  of  linearly  moving  objects,  and  providing  a  rough  initial  segmentation.  It  could 
also  be  used  for  tracking  the  actors  moving  on  more  complex  trajectories.  Use  of  the  motion  centroid  can 
be  unreliable  in  estimating  the  centroid  of  the  object  if  the  sh^ie  of  the  object  changes  as  it  moves.  In  ttiis 
case  use  of  a  prediction  and  correction  mechanism  using  past  values  over  a  sufficiently  long  period  can  help. 

The  detection  scheme  also  assumes  that  there  is  only  one  activity  in  the  scene  except  for  some  background 
clutter.  If  there  are  multiple  activities  in  the  scene,  this  detection  technique  can  still  be  applied  provided 
the  activities  can  be  spatially  isolated  so  that  they  do  not  interfere  with  each  other.  In  this  case  they 
can  be  segmented  using  the  motion  information  and  tradced  separately.  If  a  predictive  tracker  is  used,  an 
occasional  crossing  of  different  activities  can  be  tolerated  as  long  as  the  t^ons  can  be  separated  again 
later.  In  our  experiments,  the  periodic  activity  samples  consist  of  at  least  four  cycles  of  the  activity.  Four 
cycles  were  needed  to  reliably  detect  the  fundamental  fiequency  pven  that  there  is  a  considerable  amount 
of  non-repetitive  structure  from  the  background  in  the  case  of  translating,  actors. 

The  complexity  of  recognition  is  proportional  to  the  number  of  i^els  involved  in  the  activity.  More  than 
half  the  work  is  computing  the  motion  vectors  at  every  {^d  and  then  computing  the  fast  Fourier  transforms 
at  each  of  moving  pixels.  The  remaining  time  is  spent  omnputing  the  feature  vector,  the  time  for  which 
depends  on  the  local  motion  statistic  computed.  For  a  128  image  sequence,  computation  of  the  feature 
vector  of  motion  magnitudes  takes  about  3  seconds.  The  dassification  algorithm  currently  runs  oa  an  SGI 
machine  using  four  processors  and  it  takes  maximum  20  seconds  to  process  a  128  frame  sequence  of  128x128 
images. 

6  Conclusion 

We  have  described  a  general  technique  for  periodic  activity  recognition.  This  technique  uses  a  periodidty 
measure  to  detect  the  activity  and  then  a  feature  vector  based  on  motion  information  to  classify  the  activity 
into  one  of  several  known  classes.  We  have  illustrated  the  technique  using  real-world  examples  of  activities, 
and  shown  that  it  robustly  recognizes  complex  periodic  activities. 
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Figure  2:  Sample  images  of  walk  by  a  different  actor  and  toy  frog  under  different  background  and  frequency 
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Figure  3:  Sample  total  motion  magnitude  feature  vector  for  a  sample  of  walk  (top)  and  a  sample  of  run 
(bottom),  one  cycle  of  activity  is  divided  into  six  time  divisions  shown  horizontally,  each  frame  shows  spatial 
distribution  of  motion  in  a4x4  spatial  grid  (size  of  each  square  is  proportional  to  the  amount  of  motion  in 
the  neighborhood). 
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